Python Forum
Python Code Help - pip install PyMuPDF python-docx pillow
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python Code Help - pip install PyMuPDF python-docx pillow
#1
I have this code below. The purpose of the code is to: extract the paragraphs that include an asterisk and its associated photos, from a PDF document, into a Word Document.

The code works and it exports the paragraphs that include an asterisk into a Word Doc, but it is not grabbing only the photos associated with the paragraph, it sometimes exports photos that are for another paragraph. How can I modify the code to make sure it ONLY exports the images directly below the paragraph?
import fitz  # PyMuPDF
from docx import Document
from docx.shared import Inches
import io
from PIL import Image

# Load the PDF document
pdf_document = fitz.open("Sample Home.pdf")

# Create a Word document
word_document = Document()

# Iterate through each page of the PDF
for page_num in range(pdf_document.page_count):
    page = pdf_document.load_page(page_num)
    blocks = page.get_text("blocks")

    for block in blocks:
        block_text = block[4]

        # Check if the paragraph includes an asterisk
        if '*' in block_text:
            # Add the paragraph to the Word document
            word_document.add_paragraph(block_text)

            # Extract images associated with this paragraph
            image_list = page.get_images(full=True)
            for image_index, img in enumerate(image_list):
                xref = img[0]
                base_image = pdf_document.extract_image(xref)
                image_bytes = base_image["image"]

                # Load image using PIL
                image = Image.open(io.BytesIO(image_bytes))
                image_filename = f"image_{page_num}_{image_index}.png"
                image.save(image_filename)

                # Add image to the Word document
                word_document.add_picture(image_filename, width=Inches(5))

# Save the Word document
word_document.save("Extracted_Paragraphs_and_Images.docx")
Gribouillis write May-31-2024, 09:28 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Reply
#2
Got a sample PDF to experiment on?
Reply
#3
I do, I tried uploading the PDF to my post but the file size is too large.
Reply
#4
Just a little bit of the PDF will do, say 4 or 5 pages, as long as they contain the type of data you are looking for.

I tried your code on the PDF manual for my new induction cooker. It worked ok!

pdf_document = fitz.open("/home/pedro/pdfs/pdfs/user_manual_ce208.pdf")
page = pdf_document.load_page(0)
# this returns tuples of the block coordinates and the text
blocks = page.get_text("blocks")
for block in blocks:
    if 'Instruction manual' in str(block):
        print(type(block), block)
The above prints:

Output:
<class 'tuple'> (320.31451416015625, 132.03631591796875, 445.85235595703125, 147.72772216796875, 'Instruction manual\n', 3, 0) <class 'tuple'> (118.06179809570312, 468.69854736328125, 195.3158721923828, 478.35479736328125, 'Instruction manual\n', 5, 0)
Not sure what the last two integers in each tuple represent!

Just looked them up, the tuple is:

Quote:(x0, y0, x1, y1, "lines in block", block_no, block_type)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Python version 3.11.15 install skip74 1 44 Apr-08-2026, 05:59 PM
Last Post: noisefloor
  why I can't pip install faiss-cpu using python 3.14 on my windows 10 system? Cauchy 1 5,489 Oct-17-2025, 06:40 PM
Last Post: snippsat
  Extracting parts of paragraphs from word documents using python-docx library & lists Den0st 1 28,357 Oct-08-2025, 05:56 AM
Last Post: OtiliaGen
  Trying to install the edk2-rk3399 code but the script fails due some python bug mariozio 2 1,919 Aug-17-2025, 10:25 PM
Last Post: mariozio
  Pillow _getexif for python 3 Larz60+ 3 31,758 May-28-2025, 11:48 AM
Last Post: StephenTuh
  I am getting an IndentError on my python code in VS Code and i dont know why jcardenas1980 11 13,377 Mar-22-2025, 09:49 AM
Last Post: Pedroski55
Question Install Python Using ShellScript Sudheer 1 1,554 Mar-12-2025, 03:50 AM
Last Post: Tishat73
  I'm trying to install python 3.11.11 on windows 10 - it doesn't work Petonique 2 5,251 Feb-04-2025, 05:42 PM
Last Post: snippsat
  Install a module to a specific to Python Installation (one of many)) tester_V 2 5,377 Oct-29-2024, 03:25 PM
Last Post: snippsat
  Python install issue redreign83 2 1,471 Oct-04-2024, 07:59 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020