Creating matrix counting words in list of strings

jazmad · Dec-23-2018, 08:08 AM

I have a dataframe (over 1m rows) where one of the columns contains a different sentence in each row.

I would like to create a 2-column array where the first column contains every word that appears in any sentence and the second column is a count of the number of times it appears in total.

I've written the following functions (pulling the relevant column in as 'data', which do work but are very slow if I take in more than about 100,000 rows. Is there a more efficient way to do what I want?

def TextToArray(Q):
    return Q.split()

def CreateWordList(data):
    
    m=len(data)
    
    import numpy as np
    words = np.empty(shape=0,dtype=str)
    temp_string = ''
    for i in range(m):
        temp_string = temp_string + ' ' + data[i]
    words=TextToArray(temp_string)

    words, count = np.unique(words,return_counts=True)
    
    result = np.append(words,counts,axis=1)

    return words

stullis · Dec-23-2018, 01:01 PM

First, you have some extra stuff going on that doesn't actually do anything (line 9 for instance creates an empty container and is overwritten before use). Lines 10 through 13 can be done on a single line using str.join(). Here's a rewritten version:

import numpy as np

def CreateWordList(data):
    words, count = np.unique(' '.join(data).split(),return_counts=True)
    result = np.append(words,counts,axis=1)

    return words

The function returns words, but I believe it should be returning result. If that's correct, then we can do this:

import numpy as np

def CreateWordList(data):
    words, count = np.unique(' '.join(data).split(),return_counts=True)
    return np.append(words,counts,axis=1)

Now, this methodology will always have issues because we're combining 1M+ strings into one and then processing it. You may get better performance with collections.Counter instead of numpy since we can put each string through Counter and still get the desired result:

import collections

def create_word_list(data):
    count = collections.Counter()
    for words in data:
        count.update(words.split())

    return count

You could also write the script to employ multithreading and update a master counter with each return from create_word_list(). That master count would need a lock added to it for thread safety.

jazmad · Dec-23-2018, 05:47 PM

Thank you.

Yes, I did mean to return result. To be honest, this was an edit of the actual code for the purposes of this question and I just overlooked changing this.

Merry Christmas!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Find strings by index from a list of indexes in a different Pandas dataframe column	Calab	3	2,800	Aug-26-2024, 04:52 PM Last Post: Calab
	Creating release on github remote repo with file list	Unkovic	1	1,577	Jun-20-2024, 04:55 AM Last Post: rodiongork
	Creating look up table/matrix from 3d data array	chai0404	3	4,667	Apr-09-2020, 04:53 AM Last Post: buran
	convert a list of string+bytes into a list of strings (python 3)	pacscaloupsu	4	14,566	Mar-17-2020, 07:21 AM Last Post: markfilan
	Can python detect style of language? eg. Flowery words vs simple words	mcp111	4	4,170	Jan-07-2020, 02:25 PM Last Post: mcp111
	Creating A List of DataFrames & Manipulating Columns in Each DataFrame	firebird	1	6,230	Jul-31-2019, 04:04 AM Last Post: scidam
	counting the occurence of a specified number in a numpy-matrix	PhysChem	1	3,564	Apr-03-2019, 01:37 PM Last Post: PhysChem
	Checking the elements of a matrix with an elements of a list	juniorcoder	11	9,404	Sep-17-2018, 03:02 PM Last Post: gruntfutuk
	Creating a matrix of rolling variances	vvvcvvcv	1	3,762	May-26-2018, 12:51 PM Last Post: killerrex

Creating matrix counting words in list of strings

User Panel Messages

Announcements