painfully slow runtime when handling data

dadazhu · (This post was last modified: Jan-19-2023, 05:07 AM by dadazhu.)

Hi everyone,

I am new to Python, and just completed some codes as below that take forever to run. It seemed working if I only run the first 10 rows in the main DataFrame, I hope someone can assist me in improving the codes, so that it can run much faster. Many thanks

from datetime import timedelta
import numpy as np
import pandas as pd

# Get data from the given excel file
filepath = r'C:\Users\User\Desktop\Power BI\All.xlsx'

df_CoC = pd.read_excel(filepath, usecols=['ID, 'Is there time loss?', 'MC Capacity Date From',
                                          'MC Capacity Date To'], sheet_name='CoC') #18000 rows
df_WRS = pd.read_excel(filepath, usecols=['ID', 'Date From', 'Date To'], sheet_name='WRS') #22000 rows
df_Open = pd.read_excel(filepath, sheet_name='Open')  #2500 rows
Count_Open = df_Open['ID'].count()
df_Open['CoC_MissingWRS'] = None
df_Open['WRS_MissingCoC'] = None

####### Fill new 'CoC_MissingWRS' column in open query ########
for i in range(0, Count_Open):
    if i > 90: break
    imn = df_Open.loc[i, 'Injury Management: Injury Management Number']

    #Fill WRS date array
    arrWRSDates = np.array([])
    filtered_df = df_WRS[df_WRS['ID'] == imn]
    if not filtered_df.empty:
        Count_WRS = filtered_df['ID'].count()-1
        for w in range(0, Count_WRS):
                #Date_From is always available
                Date_From = filtered_df.iloc[w]['Date From'].date()
                if filtered_df.iloc[w]['Date To'] is None: Date_To = Date_From
                else: Date_To = filtered_df.iloc[w]['Date To'].date()
                if Date_To < Date_From: Date_To = Date_From
                adate = Date_From
                while adate <= Date_To:
                    if adate not in arrWRSDates: arrWRSDates = np.append(arrWRSDates, adate)
                    adate += timedelta(days=1)

    # Fill Timeloss & all CoC date array
    arrTLCoCDates = np.array([])
    arrAllCoCDates = np.array([])
    filtered_df = df_CoC[df_CoC['ID'] == imn]
    if not filtered_df.empty:
        Count_CoC= filtered_df['ID'].count() - 1
        for c in range(0, Count_CoC):
                Date_From = filtered_df.iloc[c]['MC Capacity Date From'].date()
                if filtered_df.iloc[c]['MC Capacity Date To'] is None: Date_To = Date_From
                else: Date_To = filtered_df.iloc[c]['MC Capacity Date To'].date()
                if Date_To < Date_From or Date_To is None: Date_To = Date_From
                adate = Date_From
                while adate <= Date_To:
                    if df_CoC.iloc[c]['Is there time loss?'] == 'Yes':
                        if adate not in arrTLCoCDates: arrTLCoCDates = np.append(arrTLCoCDates, adate)
                    else:
                        if adate not in arrAllCoCDates: arrAllCoCDatesCoCDates = np.append(arrAllCoCDates, adate)
                    adate += timedelta(days=1)


    ##### Check if CoC missing WRS #####
    bMissingWRS = 'No'
    if len(arrTLCoCDates) > 0:
        adate = min(arrTLCoCDates)
        while adate <= max(arrTLCoCDates) and bMissingWRS == 'No':
            if adate not in arrWRSDates and adate.weekday() < 5: bMissingWRS = 'Yes'
    df_Open.loc[i, 'CoC_MissingWRS'] = bMissingWRS

    ##### Check if WRS missing CoC #####
    bMissingCoC = 'No'
    if len(arrWRSDates) > 0:
        adate = min(arrWRSDates)
        while adate <= max(arrWRSDates) and bMissingCoC == 'No':
            if adate not in arrAllCoCDates and adate.weekday() < 5: bMissingCoC = 'Yes'
    df_Open.loc[i, 'WRS_MissingCoC'] = bMissingCoC

jefsummers · Jan-19-2023, 05:06 PM

Where is it slow?
Suggest you put sections into cells in a Jupyter notebook and execute the code segments individually. That should narrow it down. Then we can more likely help

dadazhu · Jan-19-2023, 10:14 PM

Hi jefsummers,

Thank you for your advice, I tried to run the code in Power BI and PyCharm Community, I never got to see the result, unless I run it for the first 10 rows or something. I will look Jupyter notebook up, and will get back to you. Thank you again.

***snippsat*** · (This post was last modified: Jan-20-2023, 07:11 PM by snippsat.)

The way you do it will be very slow,every time write a loop in Pandas it often the wrong approch.
The way Pands is build should try to use Vectorization it will be a lot faster,look at this blog post.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Extract args=, value that was passed to Multiprocessing.Proc object during runtime?	haihal	1	1,118	Dec-08-2024, 07:04 AM Last Post: Gribouillis
	change dataclass to frozen at runtime	jpanico	2	2,434	Oct-06-2024, 09:38 AM Last Post: snippsat
	class and runtime	akbarza	4	2,451	Mar-16-2024, 01:32 PM Last Post: deanhystad
	Big O runtime nested for loop and append	yarinsh	4	3,999	Dec-31-2022, 11:50 PM Last Post: stevendaprano
	python exception handling handling .... with traceback	mg24	3	4,444	Nov-09-2022, 07:29 PM Last Post: Gribouillis
	Reducing runtime memory usage in Cpython interpreter	david_the_graower	2	3,609	Oct-18-2021, 09:56 PM Last Post: david_the_graower
	Object reference in Dict - not resolved at runtime	benthomson	2	3,193	Apr-02-2020, 08:50 AM Last Post: benthomson
	PyCharm asking for VC++ runtime 14.0 whereas I have already installed VC++ 19.0	SarmadiRizvi	1	2,978	Apr-02-2020, 06:17 AM Last Post: snippsat
	EOFError: EOF when reading a line - Runtime Error	RavCOder	6	12,943	Sep-27-2019, 12:22 PM Last Post: RavCOder
	Different runtime programs	mhvozdesky	1	67,073	Oct-24-2018, 02:50 PM Last Post: ichabod801

painfully slow runtime when handling data

User Panel Messages

Announcements