Oct-16-2019, 12:39 PM
(This post was last modified: Oct-16-2019, 12:39 PM by AlekseyPython.)
Python 3.7.3, pandas 0.25.1
I wrote a program using the DataFrame of the pandas library. But the speed is tens (or even a hundred) times less, than using dict and tuple. Index built on the lightest data type- 8bit unsigned int ('u1'). When adding new data, I always sort the index (although without this, the performance is just as terribly low). Moreover, the file contains about 6,000,000 lines, and there are only 2 different clients (that is, rebuilding the index is very rare, and the number of lines in the DataFrame is very small).
Why?
I wrote a program using the DataFrame of the pandas library. But the speed is tens (or even a hundred) times less, than using dict and tuple. Index built on the lightest data type- 8bit unsigned int ('u1'). When adding new data, I always sort the index (although without this, the performance is just as terribly low). Moreover, the file contains about 6,000,000 lines, and there are only 2 different clients (that is, rebuilding the index is very rare, and the number of lines in the DataFrame is very small).
Why?
#create dataframe
dtype=np.dtype([('day_begin','u4'), ('day_end','u4'), ('price_begin','f4'), ('price_end','f4'), ('Client','u1')])
auxiliary_array = np.empty(0, dtype=dtype)
periods_clients = pd.DataFrame(auxiliary_array)
periods_clients.set_index(['Client'], inplace=True)
#fill dataframe from file
with open(path_file) as csv_file:
reader = csv.reader(csv_file)
fieldnames = ['Date', 'Client', 'Price']
reader = csv.DictReader(csv_file, fieldnames=fieldnames, delimiter=';')
for dict_str in reader:
Client = dict_str['Client']
if Client not in periods_clients.index:
periods_clients.loc[Client] = [current_date, current_date, current_price, current_price]
periods_tickers.sort_index(level=0, inplace=True)
else:
periods_clients.loc[Client].day_end = current_date
periods_clients.loc[Client].price_end = current_price
