Author Topic: how to extract images embedded in a self-contained html page as files?  (Read 1184 times)

0 Members and 1 Guest are viewing this topic.

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4018
  • Country: gb
if someone (insanely) gives you a 14Mbyte Self-Contained Web Page html file with a table where the third column of all the row contains an embedded image

Code: [Select]
  </v:shapetype><v:shape id="Picture_x0020_507" o:spid="_x0000_s1518" type="#_x0000_t75"
   style='position:absolute;margin-left:3pt;margin-top:11.25pt;width:201pt;
   height:184.5pt;z-index:494;visibility:visible' o:gfxdata="UEsDBBQABgAIAAAAIQDAV3P7DAEAA
BkCAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbJSRwU7DMBBE
70j8g+UrShw4IISa9EDgCBUqH2DZm8QlXlteN7R/j93QS0WQONq7M2/GXq0PdmQTBDIOa35bVpwB
KqcN9jX/2L4UD5xRlKjl6BBqfgTi6+b6arU9eiCW1Eg1H2L0j0KQGsBKKp0HTJPOBStjOoZeeKk+
ZQ/irqruhXIYAWMRswdvVi10cj9G9nxI13OSACNx9jQvZlbNpfejUTKmpGJCfUEpfghlUp52aDCe
blIMLn4l5MkyYFm38/2FztjcbOehz6i39JrBaGAbGeKrtCm50IGENyruAyTj8m907mapcF1nFJRt
oM2sPHdZAmj3hQGm/7q3SfYO09ldnD62+QYAAP//AwBQSwMEFAAGAAgAAAAhAAjDGKTUAAAAkwEA
AAsAAABfcmVscy8ucmVsc6SQwWrDMAyG74O+g9F9cdrDGKNOb4NeSwu7GltJzGLLSG7avv1M2WAZ
...

How to extract every pictures as files?

If they were just a few, I would do it manually, but they are thousand and thousand, so I need a script or something  :-//
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4018
  • Country: gb
(Can I seriously embed a .png image into an HTML page?  :o :o :o
Yup, see here ... )
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online Smokey

  • Super Contributor
  • ***
  • Posts: 2693
  • Country: us
  • Not An Expert
Paste exactly what you posted into chatgpt.  Added I wanted a python script. 

Code: [Select]
import re
import os
from bs4 import BeautifulSoup
import base64

# Read the HTML content from the file
file_path = 'path/to/your/file.html'  # Update this with the path to your HTML file
with open(file_path, 'r', encoding='utf-8') as file:
    html_content = file.read()

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Find all elements that contain the image data
image_elements = soup.find_all('v:shape', {'o:gfxdata': True})

# Create a directory to save the extracted images
output_dir = 'extracted_images'
os.makedirs(output_dir, exist_ok=True)

# Loop through each image element and extract the image data
for index, element in enumerate(image_elements):
    # Extract the base64 image data
    gfxdata = element.get('o:gfxdata')
   
    # Decode the base64 data
    image_data = base64.b64decode(gfxdata)
   
    # Determine the image format (assuming PNG for this example)
    image_filename = os.path.join(output_dir, f'image_{index + 1}.png')
   
    # Save the image to a file
    with open(image_filename, 'wb') as image_file:
        image_file.write(image_data)

    print(f'Saved {image_filename}')
 

Offline ledtester

  • Super Contributor
  • ***
  • Posts: 3108
  • Country: us
One way is to use a library like puppeteer to control a headless version of Chrome.

Some links:

- headless Chrome: https://developer.chrome.com/docs/puppeteer/ssr/
- puppeteer: https://pptr.dev/
- example puppeteer code to download the image data of an image ("Method 3"):

https://www.webshare.io/academy-article/puppeteer-download-images#:~:text=or%20unique%20selector.-,Best%20methods%20for%20downloading%20a%20single%20image,-Unique%20Selectors%20or

 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf