Dec-01-2022, 12:48 PM
Using python 3. I need to loop through a folder that contains excel files and each file has multiple sheets. How do I loop through all the files and all the sheets and extract to a dataframe?
I need to include however the sheet name and filename.. Some excel files the tabs names are in the order 11.22 10.22 09.22 etc while other sheets are 09.22 10.22 11.22. Not all sheets have all the columns so i need only sheets that have columns projectid and status. Also the dates in paymentdate columns aren't always in the same format
What I was able to accomplish only returns one excel file and all the worksheets for that file but I need for all files. Please help.
This is what I have so far:
I need to include however the sheet name and filename.. Some excel files the tabs names are in the order 11.22 10.22 09.22 etc while other sheets are 09.22 10.22 11.22. Not all sheets have all the columns so i need only sheets that have columns projectid and status. Also the dates in paymentdate columns aren't always in the same format
What I was able to accomplish only returns one excel file and all the worksheets for that file but I need for all files. Please help.
This is what I have so far:
from xlsxwriter import Workbook
import pandas as pd
import openpyxl
import glob
import os
path = 'filestoimport/*.xlsx'
for filepath in glob.glob(path):
xl = pd.ExcelFile(filepath)
# Define an empty list to store individual DataFrames
list_of_dfs = []
list_of_dferror= []
for sheet_name in xl.sheet_names:
df = xl.parse(sheet_name, usecols='A,D,N,B,C,E,F,G,H,I,J,K,L,M', header=0)
df.columns = df.columns.str.replace(' ', '')
df['sheetname'] = sheet_name # this adds `sheet_name` into the column
# using basename function from os
# module to print file name
file_name = os.path.basename(filepath)
df['sourcefilename'] = file_name
# only add sheets containing columns ['Status', 'ProjectID']
column_names = ['Status', 'ProjectID']
if set(column_names).issubset(df.columns):
df['Status'].fillna('', inplace=True)
df['Addedby'].fillna('', inplace=True)
# And append it to the list
list_of_dfs.append(df)
# Combine all DataFrames into one
data = pd.concat(list_of_dfs, ignore_index=True)
