My goal is to extract the English-like text from an unknown file format. The files consist of random data (Mixed alpha numerics and other characters) which contains blocks of English text. I want to extract as much text as possible to pass to Lucene.Net to index.
I wonder if there is a library or even a Lucene.Net Analyser that is suited to the task?
Aucun commentaire:
Enregistrer un commentaire