Author Topic: how to extract images embedded in a self-contained html page as files? (Read 1184 times)

DiTBho · « **on:** June 01, 2024, 02:01:11 am »

if someone (insanely) gives you a 14Mbyte Self-Contained Web Page html file with a table where the third column of all the row contains an embedded image

Code: [Select]

  </v:shapetype><v:shape id="Picture_x0020_507" o:spid="_x0000_s1518" type="#_x0000_t75"
   style='position:absolute;margin-left:3pt;margin-top:11.25pt;width:201pt;
   height:184.5pt;z-index:494;visibility:visible' o:gfxdata="UEsDBBQABgAIAAAAIQDAV3P7DAEAA
BkCAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbJSRwU7DMBBE
70j8g+UrShw4IISa9EDgCBUqH2DZm8QlXlteN7R/j93QS0WQONq7M2/GXq0PdmQTBDIOa35bVpwB
KqcN9jX/2L4UD5xRlKjl6BBqfgTi6+b6arU9eiCW1Eg1H2L0j0KQGsBKKp0HTJPOBStjOoZeeKk+
ZQ/irqruhXIYAWMRswdvVi10cj9G9nxI13OSACNx9jQvZlbNpfejUTKmpGJCfUEpfghlUp52aDCe
blIMLn4l5MkyYFm38/2FztjcbOehz6i39JrBaGAbGeKrtCm50IGENyruAyTj8m907mapcF1nFJRt
oM2sPHdZAmj3hQGm/7q3SfYO09ldnD62+QYAAP//AwBQSwMEFAAGAAgAAAAhAAjDGKTUAAAAkwEA
AAsAAABfcmVscy8ucmVsc6SQwWrDMAyG74O+g9F9cdrDGKNOb4NeSwu7GltJzGLLSG7avv1M2WAZ
...

How to extract every pictures as files?

If they were just a few, I would do it manually, but they are thousand and thousand, so I need a script or something

DiTBho · « **Reply #1 on:** June 01, 2024, 02:03:23 am »

(Can I seriously embed a .png image into an HTML page?

Yup, see here ... )

Smokey · « **Reply #2 on:** June 01, 2024, 02:07:55 am »

Paste exactly what you posted into chatgpt. Added I wanted a python script.

Code: [Select]

import re
import os
from bs4 import BeautifulSoup
import base64

# Read the HTML content from the file
file_path = 'path/to/your/file.html'  # Update this with the path to your HTML file
with open(file_path, 'r', encoding='utf-8') as file:
    html_content = file.read()

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Find all elements that contain the image data
image_elements = soup.find_all('v:shape', {'o:gfxdata': True})

# Create a directory to save the extracted images
output_dir = 'extracted_images'
os.makedirs(output_dir, exist_ok=True)

# Loop through each image element and extract the image data
for index, element in enumerate(image_elements):
    # Extract the base64 image data
    gfxdata = element.get('o:gfxdata')
    
    # Decode the base64 data
    image_data = base64.b64decode(gfxdata)
    
    # Determine the image format (assuming PNG for this example)
    image_filename = os.path.join(output_dir, f'image_{index + 1}.png')
    
    # Save the image to a file
    with open(image_filename, 'wb') as image_file:
        image_file.write(image_data)

    print(f'Saved {image_filename}')

ledtester · « **Reply #3 on:** June 01, 2024, 03:08:42 am »

One way is to use a library like puppeteer to control a headless version of Chrome.

Some links:

- headless Chrome: https://developer.chrome.com/docs/puppeteer/ssr/
- puppeteer: https://pptr.dev/
- example puppeteer code to download the image data of an image ("Method 3"):

https://www.webshare.io/academy-article/puppeteer-download-images#:~:text=or%20unique%20selector.-,Best%20methods%20for%20downloading%20a%20single%20image,-Unique%20Selectors%20or


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: how to extract images embedded in a self-contained html page as files? (Read 1184 times)

DiTBho

how to extract images embedded in a self-contained html page as files?

DiTBho

Re: how to extract images embedded in a self-contained html page as files?

Smokey

Re: how to extract images embedded in a self-contained html page as files?

ledtester

Re: how to extract images embedded in a self-contained html page as files?

Share me