Programmers: Are there any good techniques or libraries for extracting English-like text from unknown file formats

jeudi 5 mars 2015

Are there any good techniques or libraries for extracting English-like text from unknown file formats

My goal is to extract the English-like text from an unknown file format. The files consist of random data (Mixed alpha numerics and other characters) which contains blocks of English text. I want to extract as much text as possible to pass to Lucene.Net to index.

I wonder if there is a library or even a Lucene.Net Analyser that is suited to the task?

Programmers

jeudi 5 mars 2015

Are there any good techniques or libraries for extracting English-like text from unknown file formats

Aucun commentaire:

Enregistrer un commentaire