Nov-02-2021, 12:24 PM
Hi
I have code which works to extract data from tables in a PDF. The code puts the data into columns and transfers to a CSV file. The code works but I have a few problems I need some help with. From the first column in the table I needed to create a hierarchical system so I can filter the data to find specific items. I have attached a photo
I have a couple of problems with my code:
1. The level 1 data is using any data in UPPERCASE and splitting into a new column but returning items with numbers, how can i disregard numbers when using .isupper()
2. I need a level 2 but finding it difficult to get a code which can recognise bold text in the table and split that data into a column. Any ideas what i could use?
I have code which works to extract data from tables in a PDF. The code puts the data into columns and transfers to a CSV file. The code works but I have a few problems I need some help with. From the first column in the table I needed to create a hierarchical system so I can filter the data to find specific items. I have attached a photo
I have a couple of problems with my code:
1. The level 1 data is using any data in UPPERCASE and splitting into a new column but returning items with numbers, how can i disregard numbers when using .isupper()
2. I need a level 2 but finding it difficult to get a code which can recognise bold text in the table and split that data into a column. Any ideas what i could use?
# Determine hierarchy
for i, row in df_combine.iterrows():
# Level 1: if its all in uppercase it is a new level 1 hierarchy
if df_combine['Item'][i].isupper():
df_combine.loc[i, 'Level1'] = df_combine['Item'][i]
# Otherwise use the previous level 1 heirarchy
elif i>0:
df_combine.loc[i, 'Level1'] = df_combine['Level1'][i-1]
# Future development: logic to determine level 2 heirarchy
# Level 3: If it's not all uppercase, but the first character is it is a level 3 heirarchy
if (not df_combine['Item'][i].isupper()) & (df_combine['Item'][i][0].isupper()):
try:
# If the next 2 rows are all lower, but it doesn't have a rate: join it to the first row above
if (not df_combine['Item'][i+1][0].isupper()) & (not df_combine['Total Rate£'][i+1]==df_combine['Total Rate£'][i+1]) & (not df_combine['Item'][i+2][0].isupper()) & (not df_combine['Total Rate£'][i+2]==df_combine['Total Rate£'][i+2]):
df_combine.loc[i, 'Level3'] = df_combine['Item'][i] + ' ' + df_combine['Item'][i+1]+ ' ' + df_combine['Item'][i+2]
# else if the next row is all lower, but it doesn't have a rate: join it to the row above
elif (not df_combine['Item'][i+1][0].isupper()) & (not df_combine['Total Rate£'][i+1]==df_combine['Total Rate£'][i+1]):
df_combine.loc[i, 'Level3'] = df_combine['Item'][i] + ' ' + df_combine['Item'][i+1]
# else level 3 is just a one-liner
else:
df_combine.loc[i, 'Level3'] = df_combine['Item'][i]
except:
pass
# If it doesn't have a level 3, use the one above
for i, row in df_combine.iterrows():
if (not df_combine['Level3'][i]==df_combine['Level3'][i]) & (i>0):
df_combine.loc[i, 'Level3'] = df_combine['Level3'][i-1]
