jeudi 5 mars 2015

Are there any good techniques or libraries for extracting English-like text from unknown file formats


My goal is to extract the English-like text from an unknown file format. The files consist of random data (Mixed alpha numerics and other characters) which contains blocks of English text. I want to extract as much text as possible to pass to Lucene.Net to index.


I wonder if there is a library or even a Lucene.Net Analyser that is suited to the task?





Aucun commentaire:

Enregistrer un commentaire