Trouble importing text from a .docx file

atinesh922 · Dec-24-2017, 04:46 PM

I am trying tokenize words of a Word Document Doc.docx having a sentence This is a doc file. But unfortunately, each token is getting prefixed with a letter 'u'

from nltk .tokenize import word_tokenize
import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText =
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

Text = getText('Doc.docx')
words = word_tokenize(Text)
print(words)

Output:
Output : [u'This', u'is', u'a', u'doc', u'file']

Expected Output : ['This', 'is', 'a', 'doc', 'file']

hshivaraj · (This post was last modified: Dec-24-2017, 11:56 PM by hshivaraj.)

 But unfortunately, each token is getting prefixed with a letter 'u'

That is basically representing that each token is a unicode string. Try this to get rid of it

Text = getText('Doc.docx')
words = word_tokenize(Text)
words = map(str, words)
print(words)

In Python3 every string is unicode and therefore you wont get this issue (in fact its not even an issue). Use python3 or the trick above if using python2.

**nilamo** · Jan-03-2018, 05:29 PM

That shouldn't be an issue. When you actually do something with those tokens/words, the 'u' won't be there anyway.

***snippsat*** · (This post was last modified: Jan-03-2018, 06:35 PM by snippsat.)

As mention you do not do anything with those Unicode strings.
All string method work the same with Unicode string and using print the u wont be there.

>>> lst = [u'This', u'is', u'a', u'doc', u'file']
>>> for item in lst:
...     print(item)
...     
This
is
a
doc
file

>>> lst[0].upper()
u'THIS'
>>> # print and u is gone
>>> print(lst[0].upper())
THIS

You should be using python 3,there as mention is all string Unicode.
Unicode was one biggest changes in Python 3.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Trouble Importing cell2location After Installation - Need Help	haileyp	3	1,972	Jul-12-2025, 04:00 PM Last Post: snippsat
	docx file to pandas dataframe/excel	iitip92	1	5,689	Jun-27-2024, 05:28 AM Last Post: Pedroski55
	no module named 'docx' when importing docx	MaartenRo	1	9,319	Dec-31-2023, 11:21 AM Last Post: deanhystad
	Replace a text/word in docx file using Python	Devan	4	44,842	Oct-17-2023, 06:03 PM Last Post: Devan
	Need to compare the Excel file name with a directory text file.	veeran1991	1	2,903	Dec-15-2022, 04:32 PM Last Post: Larz60+
	New2Python: Help with Importing/Mapping Image Src to Image Code in File	CluelessITguy	0	1,615	Nov-17-2022, 04:46 PM Last Post: CluelessITguy
	Use module docx to get text from a file with a table	Pedroski55	8	31,145	Aug-30-2022, 10:52 PM Last Post: Pedroski55
	python-docx regex: replace any word in docx text	Tmagpy	4	5,218	Jun-18-2022, 09:12 AM Last Post: Tmagpy
	Problem with importing Python file in Visual Studio Code	DXav	7	13,532	Jun-15-2022, 12:54 PM Last Post: snippsat
	importing functions from a separate python file in a separate directory	Scordomaniac	3	2,997	May-17-2022, 07:49 AM Last Post: Pedroski55

Trouble importing text from a .docx file

User Panel Messages

Announcements