May-31-2024, 09:04 PM
(This post was last modified: May-31-2024, 09:28 PM by Gribouillis.)
I have this code below. The purpose of the code is to: extract the paragraphs that include an asterisk and its associated photos, from a PDF document, into a Word Document.
The code works and it exports the paragraphs that include an asterisk into a Word Doc, but it is not grabbing only the photos associated with the paragraph, it sometimes exports photos that are for another paragraph. How can I modify the code to make sure it ONLY exports the images directly below the paragraph?
The code works and it exports the paragraphs that include an asterisk into a Word Doc, but it is not grabbing only the photos associated with the paragraph, it sometimes exports photos that are for another paragraph. How can I modify the code to make sure it ONLY exports the images directly below the paragraph?
import fitz # PyMuPDF
from docx import Document
from docx.shared import Inches
import io
from PIL import Image
# Load the PDF document
pdf_document = fitz.open("Sample Home.pdf")
# Create a Word document
word_document = Document()
# Iterate through each page of the PDF
for page_num in range(pdf_document.page_count):
page = pdf_document.load_page(page_num)
blocks = page.get_text("blocks")
for block in blocks:
block_text = block[4]
# Check if the paragraph includes an asterisk
if '*' in block_text:
# Add the paragraph to the Word document
word_document.add_paragraph(block_text)
# Extract images associated with this paragraph
image_list = page.get_images(full=True)
for image_index, img in enumerate(image_list):
xref = img[0]
base_image = pdf_document.extract_image(xref)
image_bytes = base_image["image"]
# Load image using PIL
image = Image.open(io.BytesIO(image_bytes))
image_filename = f"image_{page_num}_{image_index}.png"
image.save(image_filename)
# Add image to the Word document
word_document.add_picture(image_filename, width=Inches(5))
# Save the Word document
word_document.save("Extracted_Paragraphs_and_Images.docx")
Gribouillis write May-31-2024, 09:28 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
