Dec-23-2018, 08:08 AM
I have a dataframe (over 1m rows) where one of the columns contains a different sentence in each row.
I would like to create a 2-column array where the first column contains every word that appears in any sentence and the second column is a count of the number of times it appears in total.
I've written the following functions (pulling the relevant column in as 'data', which do work but are very slow if I take in more than about 100,000 rows. Is there a more efficient way to do what I want?
I would like to create a 2-column array where the first column contains every word that appears in any sentence and the second column is a count of the number of times it appears in total.
I've written the following functions (pulling the relevant column in as 'data', which do work but are very slow if I take in more than about 100,000 rows. Is there a more efficient way to do what I want?
def TextToArray(Q):
return Q.split()
def CreateWordList(data):
m=len(data)
import numpy as np
words = np.empty(shape=0,dtype=str)
temp_string = ''
for i in range(m):
temp_string = temp_string + ' ' + data[i]
words=TextToArray(temp_string)
words, count = np.unique(words,return_counts=True)
result = np.append(words,counts,axis=1)
return words
