Python Forum
How can I multithread to optimize a groupby task:
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How can I multithread to optimize a groupby task:
#1
I created a program to aggregate a large dataframe over one variable - userid. The program executes a groupby to calculate the mean, min and max of 10 variables for each userid. I've enclosed a proxy for this code. First, the code creates the dataframe. Second, it aggregates over userid. The code ran in 20 minutes. I would like to optimize this code by multithreading.

from datetime import datetime

import numpy as np
import random
import pandas as pd
 
print('Initial time:',datetime.now().strftime("%H:%M:%S"))

def fun_user_id(start, end, step):
    num = np.linspace(start, end,(end-start)
                      *int(1/step)+1).tolist()
    return [round(i, 0) for i in num]

def fun_rand_num():
    return list(map(lambda x: random.randint(300,800), range(1, 100000001)))

userid=fun_user_id(1,100000001,.5)
var1=fun_rand_num()
var2=fun_rand_num()
var3=fun_rand_num()
var4=fun_rand_num()
var5=fun_rand_num()
var6=fun_rand_num()
var7=fun_rand_num()
var8=fun_rand_num()
var9=fun_rand_num()
var10=fun_rand_num()


df = pd.DataFrame(list(zip(userid,var1, var2,var3,var4,var5,var6,var7,var8,var9,var10)),
               columns =['userid','var1', 'var2','var3','var4','var5','var6', 'var7','var8','var9','var10'])

varlistdic= {"var1" : ["mean","max","min"], 
             "var2" : ["mean","max","min"],
             "var3" : ["mean","max","min"],
             "var4" : ["mean","max","min"],
             "var5" : ["mean","max","min"],
             "var6" : ["mean","max","min"], 
             "var7" : ["mean","max","min"],
             "var8" : ["mean","max","min"],
             "var9" : ["mean","max","min"],
             "var10" : ["mean","max","min"], 
             }

gr=df.groupby(['userid'])
df_sum=gr.agg(varlistdic)
df_sum=df_sum.pipe(lambda x: x.set_axis(x.columns.map('_'.join),axis=1))
df_sum.reset_index(inplace=True)

print('End Time:',datetime.now().strftime("%H:%M:%S"))
Gribouillis write Jun-30-2023, 03:43 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  backtesting can't optimize Running_Code 1 1,001 May-23-2025, 07:46 PM
Last Post: snippsat
  count certain task in task manager[solved] kucingkembar 2 2,557 Aug-29-2022, 05:57 PM
Last Post: kucingkembar
  Optimization using scipy.optimize KaneBilliot 3 3,766 Nov-30-2021, 08:03 AM
Last Post: Gribouillis
  Schedule a task and render/ use the result of the task in any given time klllmmm 2 3,358 May-04-2021, 10:17 AM
Last Post: klllmmm
  How to measure execution time of a multithread loop spacedog 2 4,817 Apr-24-2021, 07:52 AM
Last Post: spacedog
  How to create a task/import a task(task scheduler) using python Tyrel 7 6,817 Feb-11-2021, 11:45 AM
Last Post: Tyrel
  Why the multithread does not reduce the execution time? Nicely 2 4,106 Nov-23-2019, 02:28 PM
Last Post: Nicely
  is there a way to optimize my checking system? GalaxyCoyote 4 4,418 Oct-13-2019, 09:18 AM
Last Post: perfringo
  Optimize unittest loading Nazz 3 3,953 Mar-05-2019, 11:59 AM
Last Post: Nazz
  multithread or multicore Chris2018 1 3,174 Oct-11-2018, 06:52 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020