Jul-28-2020, 08:30 PM
Hi Everyone,
I am trying to read and sort a large text file (10 GBs) in chunks. The aim is to sort the data based on column 2. The following achieves reading the (huge) data but I am struggling to sort it. Can anyone help please? I can sort the individual chunks (via argsort) but I am don't know how to merge everything; outputting the final Nx4 sorted array (that I plan to store in HDF5 file).
I am trying to read and sort a large text file (10 GBs) in chunks. The aim is to sort the data based on column 2. The following achieves reading the (huge) data but I am struggling to sort it. Can anyone help please? I can sort the individual chunks (via argsort) but I am don't know how to merge everything; outputting the final Nx4 sorted array (that I plan to store in HDF5 file).
filename = "file.txt"
nrows = sum(1 for line in open(filename)) # nrows in the file
ncols = 4 # no. of cols.
idata = np.empty((nrows, ncols), dtype=np.float32) # np array to extract the data
i = 0
chunks = pd.read_csv(filename, chunksize=10000,
names=['ch', 'tmstp', 'lt', 'rt'])
# chunks is the complete bulk and each chunk (10,000x4)
for chunk in chunks:
m, _ = chunk.shape # m = 10,000
idata[i:i+m, :] = chunk # chunk dataframe => np array idata
i += m
print(idata) # contains all read data from file.txt
