Extracting Data from tables

DataExtrator · Nov-02-2021, 12:24 PM

Hi

I have code which works to extract data from tables in a PDF. The code puts the data into columns and transfers to a CSV file. The code works but I have a few problems I need some help with. From the first column in the table I needed to create a hierarchical system so I can filter the data to find specific items. I have attached a photo

I have a couple of problems with my code:

1. The level 1 data is using any data in UPPERCASE and splitting into a new column but returning items with numbers, how can i disregard numbers when using .isupper()

2. I need a level 2 but finding it difficult to get a code which can recognise bold text in the table and split that data into a column. Any ideas what i could use?

# Determine hierarchy 
for i, row in df_combine.iterrows():
    # Level 1: if its all in uppercase it is a new level 1 hierarchy
    if df_combine['Item'][i].isupper():
        df_combine.loc[i, 'Level1'] = df_combine['Item'][i]
    # Otherwise use the previous level 1 heirarchy
    elif i>0:
        df_combine.loc[i, 'Level1'] = df_combine['Level1'][i-1]
    
    # Future development: logic to determine level 2 heirarchy
    
    # Level 3: If it's not all uppercase, but the first character is it is a level 3 heirarchy
    if (not df_combine['Item'][i].isupper()) & (df_combine['Item'][i][0].isupper()):
        try:
            # If the next 2 rows are all lower, but it doesn't have a rate: join it to the first row above
            if (not df_combine['Item'][i+1][0].isupper()) & (not df_combine['Total Rate£'][i+1]==df_combine['Total Rate£'][i+1]) & (not df_combine['Item'][i+2][0].isupper()) & (not df_combine['Total Rate£'][i+2]==df_combine['Total Rate£'][i+2]):
                df_combine.loc[i, 'Level3'] = df_combine['Item'][i] + ' ' + df_combine['Item'][i+1]+ ' ' + df_combine['Item'][i+2]
            # else if the next row is all lower, but it doesn't have a rate: join it to the row above
            elif (not df_combine['Item'][i+1][0].isupper()) & (not df_combine['Total Rate£'][i+1]==df_combine['Total Rate£'][i+1]):
                df_combine.loc[i, 'Level3'] = df_combine['Item'][i] + ' ' + df_combine['Item'][i+1]
            # else level 3 is just a one-liner
            else:
                df_combine.loc[i, 'Level3'] = df_combine['Item'][i] 
        except:
            pass
# If it doesn't have a level 3, use the one above
for i, row in df_combine.iterrows():
    if (not df_combine['Level3'][i]==df_combine['Level3'][i]) & (i>0):
        df_combine.loc[i, 'Level3'] = df_combine['Level3'][i-1]

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	extracting data from a user-completed fillable pdf	Perry	2	1,579	Sep-25-2025, 01:49 PM Last Post: DeaD_EyE
	Extracting data from bank statement PDFs (Accountant)	a4avinash	4	18,002	Feb-27-2025, 01:53 PM Last Post: griffinhenry
	Confused by the different ways of extracting data in DataFrame	leea2024	1	1,351	Aug-17-2024, 01:34 PM Last Post: deanhystad
	Extracting the correct data from a CSV file	S2G	6	3,009	Jun-03-2024, 04:50 PM Last Post: snippsat
	Better python library to create ER Diagram by using pandas data frames as tables	klllmmm	0	4,384	Oct-19-2023, 01:01 PM Last Post: klllmmm
	Extracting Data into Columns using pdfplumber	arvin	17	39,781	Dec-17-2022, 11:59 AM Last Post: arvin
	extracting data	ajitnayak1987	1	2,593	Jul-29-2021, 06:13 AM Last Post: bowlofred
	Extracting and printing data	ajitnayak1987	0	2,257	Jul-28-2021, 09:30 AM Last Post: ajitnayak1987
	Extracting unique pairs from a data set based on another value	rybina	2	3,680	Feb-12-2021, 08:36 AM Last Post: rybina
	extracting data/strings from Word doc	mikkelibsen	1	3,137	Feb-10-2021, 11:06 AM Last Post: Larz60+

Extracting Data from tables

User Panel Messages

Announcements