My python code is running very slow on millions of records

shantanu97 · (This post was last modified: Dec-28-2021, 01:03 AM by shantanu97.)

I want to process data through a python that has 2 million rows and more than 100 columns. My code takes 20 minutes to create an output file. I don't know if there is something else that make my code faster, or if I can change something to make it faster. Any help would be greatly appreciated!

df2 = pd.DataFrame()
    for fn in csv_files:  # Looping Over CSV Files
        all_dfs = pd.read_csv(fn, header=None)

        # Finding non-null columns
        non_null_columns = [col for col in all_dfs.columns if all_dfs.loc[:, col].notna().any()]

        # print(non_null_columns)
        for i in range(0, len(all_dfs)):  # Row Loop
            SourceFile = ""
            RowNumber = ""
            ColumnNumber = ""
            Value = ""
            for j in range(0, len(non_null_columns)):  # Column Loop
                SourceFile = Path(fn.name)
                RowNumber = i+1
                ColumnNumber = j+1
                Value = all_dfs.iloc[i, j]
                df2 = df2.append(pd.DataFrame({
                    "SourceFile": [SourceFile],
                    "RowNumber": [RowNumber],
                    "ColumnNumber": [ColumnNumber],
                    "Value": [Value]
                }), ignore_index=True)
                # print(df2)
    df2['Value'].replace('', np.nan, inplace=True)  # Removing Null Value
    df2.dropna(subset=['Value'], inplace=True)
    df2.to_csv(os.path.join(path_save, f"Compiled.csv"), index=False)
    print("Output: Compiled.csv")

Attach python code.

paul18fr · (This post was last modified: Dec-27-2021, 12:23 PM by paul18fr.)

What type of data are you dealing with in the original csv file? pure numbers? strings? both? The

Appending is costly, and maybe loops can be avoided using vectorisation if data are numbers.

**Larz60+** · Dec-27-2021, 11:18 PM

I expect that you are paging memory.
How much memory do you have?
What paul18fr states about appending is true and should be avoided.
Do you need to have everything resident at the same time?

shantanu97 · Dec-28-2021, 12:50 AM

(Dec-27-2021, 12:22 PM)paul18fr Wrote: What type of data are you dealing with in the original csv file? pure numbers? strings? both? The

Appending is costly, and maybe loops can be avoided using vectorisation if data are numbers.

It consists of a string, number and a date.

shantanu97 · Dec-28-2021, 12:54 AM

(Dec-27-2021, 11:18 PM)Larz60+ Wrote: I expect that you are paging memory.
How much memory do you have?
What paul18fr states about appending is true and should be avoided.
Do you need to have everything resident at the same time?

I use a very powerful PC RAM:24GB, HardDisk:250GB and i7 processor. Can you tell me what I need to use if the appending function is costly? Is there any way we can make a loop faster?

**Larz60+** · Dec-28-2021, 02:23 AM

untested, but close:

import pandas as pd
import glob

path = Your csv file path
os.path.join(path, "*.csv")
filelist = glob.glob(path + "/*.csv")

df = pd.concat((pd.read_csv(f) for f in filelist))
df = df.fillna('') # replace nan

shantanu97 · Dec-28-2021, 02:34 AM

(Dec-28-2021, 02:23 AM)Larz60+ Wrote: untested, but close:

import pandas as pd
import glob

path = Your csv file path
os.path.join(path, "*.csv")
filelist = glob.glob(path + "/*.csv")

df = pd.concat((pd.read_csv(f) for f in filelist))
df = df.fillna('') # replace nan

I have attached test.csv file for testing.

**Larz60+** · Dec-28-2021, 11:02 AM

Please run tests and report results.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Running Brackets code	Jojogeno	2	89	Feb-13-2026, 10:30 AM Last Post: buran
	code not running even without errors	Azdaghost	2	1,080	Apr-25-2025, 07:35 PM Last Post: Azdaghost
	python code not running	Azdaghost	1	828	Apr-22-2025, 08:44 PM Last Post: deanhystad
	writing and running code in vscode without saving it	akbarza	5	4,895	Mar-03-2025, 08:14 PM Last Post: Gribouillis
	Python: How to import data from txt, instead of running the data from the code?	Melcu54	1	1,352	Dec-13-2024, 06:50 AM Last Post: Gribouillis
	Why the python is so slow?	rohhthone	2	1,745	Oct-07-2024, 09:54 PM Last Post: DeaD_EyE
	Sudden Extremely Slow / Failed Python Imports	bmccollum	1	2,578	Aug-20-2024, 02:09 PM Last Post: DeaD_EyE
	problem in running a code	akbarza	7	3,348	Feb-14-2024, 02:57 PM Last Post: snippsat
	the order of running code in a decorator function	akbarza	2	1,969	Nov-10-2023, 08:09 AM Last Post: akbarza
	validate large json file with millions of records in batches	herobpv	3	3,066	Dec-10-2022, 10:36 PM Last Post: bowlofred

My python code is running very slow on millions of records

User Panel Messages

Announcements