Given millions of documents, for each file, rank the similar documents. I have preprocessed the documents, created weighted word vectors and then implemented Simhash (locality sensitive hashing algorithm to evaluate approximate cosine similarity) to generate 64-bit fingerprint of each document. Finally block permuted hamming search was implemented in the fingerprint space to find the near duplicate.