Nov-27-2018, 05:06 AM
After reading a file I have a DataFrame with 6 columns named 'Spclty1' to 'Spclty6' and 6 corresponding columns named 'StartDt1' to 'StartDt6', along with other identifying info. Any or all of the Spclty/StartDt field pairs can be empty. Each row is one user but many users have more than 1 row (and can have more than 6 specialties). I want to consolidate this DataFrame into 1 row per user with a single column that contains all of each user's specialty codes and dates as key-value pairs.
I know better than to append to a DataFrame within a loop, so I'm appending to a list instead, then converting to a DataFrame at the end. My code is working -- it only takes 4.5 minutes on a 7 MB input file -- but it never completes on a 150 MB input file. It seems to get slower as the output list gets larger, which is contrary to my understanding that lists have little overhead. Am I doing something dumb, or is there an obscure issue at work?
I know better than to append to a DataFrame within a loop, so I'm appending to a list instead, then converting to a DataFrame at the end. My code is working -- it only takes 4.5 minutes on a 7 MB input file -- but it never completes on a 150 MB input file. It seems to get slower as the output list gets larger, which is contrary to my understanding that lists have little overhead. Am I doing something dumb, or is there an obscure issue at work?
# concatenate 6 fields into 1 for both Spclty* and StartDt*:
df_tmp['Spclty'] = df_tmp[['Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6']].values.tolist()
df_tmp['StartDt'] = df_tmp[['StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6']].values.tolist()
# remove missing values:
df_tmp['Spclty'] = [[x for x in df_tmp['Spclty'][row] if str(x) != 'nan'] for row in range(df_tmp.shape[0])]
df_tmp['StartDt'] = [[x for x in df_tmp['StartDt'][row] if str(x) != 'None'] for row in range(df_tmp.shape[0])]
# convert to a dict-like set:
df_tmp['Spclty'] = [set(zip(df_tmp['Spclty'][row], df_tmp['StartDt'][row])) for row in range(df_tmp.shape[0])]
# drop columns no longer needed:
df_tmp = df_tmp.drop(['Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6',
'StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6'],
axis = 1)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
# select rows with no RegEndDt:
df_user_open = df_user.loc[np.isnat(df_user['RegEndDt'])]
if df_user_open.shape[0] > 0:
# have rows with empty RegEndDt, so treat these as current:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user_open.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user_open.iloc[row, spcltycol])
new_row = df_user_open.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['FamilyName', 'GivenName', 'MiddleName',
'UserNbr',
'AddressLine1', 'AddressLine2', 'AddressLine3',
'Postcode', 'State',
'RegStartDt', 'RegEndDt', 'Title', 'Closed', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
