Dec-27-2021, 11:02 AM
(This post was last modified: Dec-28-2021, 01:03 AM by shantanu97.)
I want to process data through a python that has 2 million rows and more than 100 columns. My code takes 20 minutes to create an output file. I don't know if there is something else that make my code faster, or if I can change something to make it faster. Any help would be greatly appreciated!
df2 = pd.DataFrame()
for fn in csv_files: # Looping Over CSV Files
all_dfs = pd.read_csv(fn, header=None)
# Finding non-null columns
non_null_columns = [col for col in all_dfs.columns if all_dfs.loc[:, col].notna().any()]
# print(non_null_columns)
for i in range(0, len(all_dfs)): # Row Loop
SourceFile = ""
RowNumber = ""
ColumnNumber = ""
Value = ""
for j in range(0, len(non_null_columns)): # Column Loop
SourceFile = Path(fn.name)
RowNumber = i+1
ColumnNumber = j+1
Value = all_dfs.iloc[i, j]
df2 = df2.append(pd.DataFrame({
"SourceFile": [SourceFile],
"RowNumber": [RowNumber],
"ColumnNumber": [ColumnNumber],
"Value": [Value]
}), ignore_index=True)
# print(df2)
df2['Value'].replace('', np.nan, inplace=True) # Removing Null Value
df2.dropna(subset=['Value'], inplace=True)
df2.to_csv(os.path.join(path_save, f"Compiled.csv"), index=False)
print("Output: Compiled.csv")Attach python code.
Attached Files
