big-data

Naive Bayes Classifier for Text Documents

Classified text documents from a large pool using machine learning techniques. A Naive Bayes classifier was built to classify approximately 20,000 newsgroup documents.

Nov 5, 2017

Near Duplicate Detection Using Simhash

Given millions of documents, for each file, rank the similar documents. I have preprocessed the documents, created weighted word vectors and then implemented Simhash (locality sensitive hashing algorithm to evaluate approximate cosine similarity) to generate 64-bit fingerprint of each document. Finally block permuted hamming search was implemented in the fingerprint space to find the near duplicate.

Oct 10, 2017