Idiap on LinkedIn Idiap youtube channel Idiap on Twitter Idiap on Facebook
Personal tools
You are here: Home Research Resources asrt


— filed under:

A python library that facilitate the extraction of text sentences from multilingual 'pdf' documents

Data collection is an important part of research and can be time consuming. Effective tools to help with this task can be very useful.

This software contains python scripts that facilitate the extraction of sentences from multilingual 'pdf' documents. It allows to configure sentence extraction from a list of documents specifying the document language or unknown in the multilingual case. The scripts then convert the documents into text, filter, prepare sentences by the mean of user defined regular expression and classify the sentences by language before outputing them in separate files.

Currently English, French, German and Italian classification is supported.

Applications of this software can be found in Speech Recognition to prepare statistical language models, in Natural Language processing and documents indexing.

Document Actions
Resource Information
Resource type: software
Date: May 12, 2015
Size: 3.9M
Audience: Research
Ownership: Idiap Research Institute
Distribution: Web
Contact: Alexandre NANCHEN
+41 277 217 791