Slow loop due to row & column consolidation

Etrigan · Nov-27-2018, 05:06 AM

After reading a file I have a DataFrame with 6 columns named 'Spclty1' to 'Spclty6' and 6 corresponding columns named 'StartDt1' to 'StartDt6', along with other identifying info. Any or all of the Spclty/StartDt field pairs can be empty. Each row is one user but many users have more than 1 row (and can have more than 6 specialties). I want to consolidate this DataFrame into 1 row per user with a single column that contains all of each user's specialty codes and dates as key-value pairs.

I know better than to append to a DataFrame within a loop, so I'm appending to a list instead, then converting to a DataFrame at the end. My code is working -- it only takes 4.5 minutes on a 7 MB input file -- but it never completes on a 150 MB input file. It seems to get slower as the output list gets larger, which is contrary to my understanding that lists have little overhead. Am I doing something dumb, or is there an obscure issue at work?

# concatenate 6 fields into 1 for both Spclty* and StartDt*:
df_tmp['Spclty'] = df_tmp[['Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6']].values.tolist()
df_tmp['StartDt'] = df_tmp[['StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6']].values.tolist()

# remove missing values:
df_tmp['Spclty'] = [[x for x in df_tmp['Spclty'][row] if str(x) != 'nan'] for row in range(df_tmp.shape[0])]
df_tmp['StartDt'] = [[x for x in df_tmp['StartDt'][row] if str(x) != 'None'] for row in range(df_tmp.shape[0])]

# convert to a dict-like set:
df_tmp['Spclty'] = [set(zip(df_tmp['Spclty'][row], df_tmp['StartDt'][row])) for row in range(df_tmp.shape[0])]

# drop columns no longer needed:
df_tmp = df_tmp.drop(['Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6', 
                      'StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6'],
                    axis = 1)

# loop over all UserNbr: 
#   consolidate specialty fields into dict-like sets (to remove redundant codes);
#   output one row per user to new data frame
out_rows = list() 
spcltycol = df_tmp.column.get_loc('Spclty')                
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
    df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
    # select rows with no RegEndDt:
    df_user_open = df_user.loc[np.isnat(df_user['RegEndDt'])]
    if df_user_open.shape[0] > 0:   
        # have rows with empty RegEndDt, so treat these as current:
        open_combined = df_user_open.iloc[0, spcltycol]   # capture 1st row
        for row in range(1, df_user_open.shape[0]):       # union with any subsequent rows
            open_combined = open_combined.union(df_user_open.iloc[row, spcltycol])
        new_row = df_user_open.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
        new_row.append(open_combined)
        out_rows.append(new_row)

# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows, 
                      columns = ['FamilyName', 'GivenName', 'MiddleName', 
                                 'UserNbr',
                                 'AddressLine1', 'AddressLine2', 'AddressLine3',
                                 'Postcode', 'State', 
                                 'RegStartDt', 'RegEndDt', 'Title', 'Closed', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	UART & I2C slow down a loop	trix	4	2,307	Dec-28-2023, 05:14 PM Last Post: trix
	Loop row and column + 1	Scott	3	2,687	May-22-2022, 11:41 AM Last Post: Axel_Erfurt
	Need help with For Loop logic for reading docx Table column data	vintysaw	2	6,779	Jan-10-2020, 06:36 AM Last Post: vintysaw
	Loop through folder of Excel Files and extract single column	fioranosnake	2	6,129	Oct-28-2019, 05:19 PM Last Post: fioranosnake
	change array column values without loop	khalidreemy	2	6,504	May-05-2019, 09:05 AM Last Post: DeaD_EyE

Slow loop due to row & column consolidation

User Panel Messages

Announcements