A python library that facilitate the extraction of text sentences from multilingual 'pdf' documents
Data collection is an important part of research and can be time consuming. Effective tools to help with this task can be very useful.
This software contains python scripts that facilitate the extraction of sentences from multilingual 'pdf' documents. It allows to configure sentence extraction from a list of documents specifying the document language or unknown in the multilingual case. The scripts then convert the documents into text, filter, prepare sentences by the mean of user defined regular expression and classify the sentences by language before outputing them in separate files.
Currently English, French, German and Italian classification is supported.
Applications of this software can be found in Speech Recognition to prepare statistical language models, in Natural Language processing and documents indexing.