Aug-01-2020, 08:54 PM
(This post was last modified: Aug-01-2020, 08:54 PM by ateestructural.)
I have the following code:
I'm unable to understand how this statement below is filtering out non alphabets from my set of words (tokens)
import nltk
nltk.download('stopwords')
import nltk.corpus
import re
import string
# turn a doc into clean tokens
from load_file_with_function import load_doc
def clean_doc(doc):
# split the tokens by white space
tokens = doc.split()
# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape((string.punctuation)))
# remove punctuation from each wor
tokens = [re_punc.sub('', w) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop-words
stop_words = set(nltk.corpus.stopwords.words('english'))
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
print(tokens)It is working because it is someone else's code - I have to work further on itI'm unable to understand how this statement below is filtering out non alphabets from my set of words (tokens)
tokens = [word for word in tokens if word.isalpha()]I know about the string function isalpha() but do not follow how the "new" tokens get rid of non alphabets in a single statement like this. Can anyone please explain?
