Feb-10-2022, 03:54 PM
Hi,
I am trying to create an Inverted Index, but I cant seem to get it working.
So I have read in xml files that contain ID, DESC and TEXT, and have done pre-processing on them, i.e. remove stop words etc.
See my code below so far.
First my pre-processing
Inverted_Index ={'cat': ['doc_1', doc_5'], 'cow':['doc_4', 'doc_20']}
Any help would great,
Thanks
I am trying to create an Inverted Index, but I cant seem to get it working.
So I have read in xml files that contain ID, DESC and TEXT, and have done pre-processing on them, i.e. remove stop words etc.
See my code below so far.
First my pre-processing
def preprocess(document):
document = document.lower() # Lowercase
words = tokenizer.tokenize(document) # Tokenize
words = [w for w in words if not w in stop_words] # Stopwords
for pos in [wordnet.NOUN, wordnet.VERB, wordnet.ADJ, wordnet.ADV]:
words = [wordnet_lemmatizer.lemmatize(x, pos) for x in words]
return Counter(words)The read in the files and pre-processpath = 'C:/my_files/'
files = os.listdir(path)
print(len(files))
collection = {}
for file in files:
file_path=path+file
tree = ET.parse(file_path)
root = tree.getroot()
doc_id = root.find('DOCID').text
header = root.find('HEADLINE').text
text = root.find('TEXT').text
if header == None: header = ''
if text != None:
#If there is no text, then concatenate text and header
final_text = header+text
#Otherwise, just take the header
else:
final_text = header
collection[doc_id] = preprocess(final_text)Then below is my attempt at creating an Inverted Index.def inverted_index(data):
all_words = collection
index = {}
for word in all_words:
for doc, tokens in data.items():
if word in tokens :
if word in index.keys():
index[word].append(doc)
else:
index[word] = [doc]
return indexNow I believe that I need the data to come out something like belowInverted_Index ={'cat': ['doc_1', doc_5'], 'cow':['doc_4', 'doc_20']}
Any help would great,
Thanks
