15 October 2004

Automatic Document Classification

I've been having trouble organizing and finding almost anything in my 'Docs' directory and I was looking for something to do in python since I started learning it. Automatic document classification using Naive Bayesian classification seems like the ideal project. A tool to solve this problem will need to read TXTs, PDFs, PSs, HTMLs (single and multiple pages), CHMs, MS Office files, Open Office files, MS Reader files and probably a few others. Being capable of searching the web (and local drives) for related documents would also be cool; google's "related" search is a little less than perfect.