TidyWord

What is TidyWord?

TidyWord is a free software written in C to clean large text files so that they can be used for further word processing and Information Retrieval. It is fast and efficient and can easily process 70 million words in seconds. The cleaned output file can be easily given to a stemmer or can be used for purposes like word frequency, document distancing, similarity and clustering, tagging, etc.

Usage

Removes special characters and insignificant words.
Removes stop words (a list may be given).
Uses a base dictionary to smartly cluster most of the words to 25,000 odd common English words (uses 6of12 dictionary)
The output file can be fed to Porter's Stemming Algorithm. This decreases the volume of word cloud (i.e. number of distinct words) as almost 95% of the words are reduced to the base dictionary.

What is New?

Porter's Stemmer in ANSI C (porter.c)
Document Similarity code to find Euclidean, Cosine and Jaccard Similarity Coefficients (docsim.c)

To compile and run TidyWord using gcc:

gcc -std=c99 -o tidyword tidyword.c tidy.h 
./tidyword input.txt stop.txt base.txt > output.txt

To directly calculate Document Similarity:

./tidyword.sh inputfile1.txt inputfile2.txt

For suggestions and bug reports: zafarullahmahmmod@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TidyWord

What is TidyWord?

Usage

What is New?

To compile and run TidyWord using gcc:

To directly calculate Document Similarity:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
base.txt		base.txt
docsim.c		docsim.c
docsim.h		docsim.h
porter.c		porter.c
stop.txt		stop.txt
tidy.h		tidy.h
tidyword.c		tidyword.c
tidyword.sh		tidyword.sh

Folders and files

Latest commit

History

Repository files navigation

TidyWord

What is TidyWord?

Usage

What is New?

To compile and run TidyWord using gcc:

To directly calculate Document Similarity:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages