Dec-24-2017, 04:46 PM
I am trying tokenize words of a Word Document
Doc.docx having a sentence This is a doc file. But unfortunately, each token is getting prefixed with a letter 'u'from nltk .tokenize import word_tokenize
import docx
def getText(filename):
doc = docx.Document(filename)
fullText =
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
Text = getText('Doc.docx')
words = word_tokenize(Text)
print(words)Output:Output : [u'This', u'is', u'a', u'doc', u'file']Expected Output : ['This', 'is', 'a', 'doc', 'file']
