Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
Docsplit is currently at version 0.7.6.
Docsplit is an open-source component of DocumentCloud.
Installation & Dependencies | Usage | Internals | Change Log
[aptitude | port | brew] install graphicsmagick
Note: the gem will take a minute to download — the JODConverter jar file tips the scales at 2MB.
The Docsplit gem includes both the docsplit command-line utility
as well as a Ruby API. The available commands and options are identical in both.
--output or -o can be passed to any command in order to
store the generated files in a directory of your choosing.
images--size --format --pages --density
Ruby: extract_images
Generates an image for each page in the document at the specified resolution
and format. Pass --pages or -p to choose the specific pages to
image. Passing
--size or -s will specify the desired
image resolution, --density or -d will specify the DPI to rasterize the images
at during conversion by GraphicsMagick, and --format or -f
will select the format of the final images.
docsplit images example.pdf docsplit images docs/*.pdf --size 700x,50x50 --format gif --pages 3,10-15,42
Docsplit.extract_images('example.doc', :size => '1000x', :format => [:png, :jpg])
text--pages --ocr --no-ocr --no-clean --language --no-orientation-detection
Ruby: extract_text
Extract the complete UTF-8-encoded plain text of a document to a
single file. If you'd like to extract the text for each page separately,
pass --pages all. You can use the --ocr and --no-ocr
flags to force OCR, or disable it, respectively. By default (if Tesseract is installed)
Docsplit will OCR the text of each page for which it fails to extract text
directly from the document. Docsplit will also attempt to clean up garbage
characters in the OCR'd text — to disable this, pass the
--no-clean flag.
By default Tesseract ships only with english extraction data. If any additional language models are installed you can select one using the --language flag. If Tesseract's orientation detection model Docsplit will automatically use it unless you specify not to with the --no-orientation-detection.
docsplit text path/to/doc.pdf --pages all --language deu
docs = Dir['storage/originals/*.doc'] Docsplit.extract_text(docs, :ocr => false, :output => 'storage/text')
pages--pages
Ruby: extract_pages
Burst apart a document into single-page PDFs. Use --pages to
specify the individual pages (or ranges of pages) you'd like to generate.
docsplit pages path/to/doc.pdf --pages 1-10
Docsplit.extract_pages('path/to/presentation.ppt') Docsplit.extract_pages('doc.pdf', :pages => 1..10)
pdf
Ruby: extract_pdf
Convert documents into PDFs. Any type of document that LibreOffice can read
may be converted. These include the Microsoft Office formats: doc, docx, ppt,
xls and so on, as well as html, odf, rtf, swf, svg, and wpd.
The first time that you convert a new file type, LibreOffice will lazy-load
the code that processes it — subsequent conversions will be much faster.
docsplit pdf documentation/*.html
Docsplit.extract_pdf('expense_report.xls')
author, date, creator, keywords, producer, subject, title, length
Ruby: extract_...
Retrieve a piece of metadata about the document. The docsplit
utility will print to stdout, the Ruby API will return the value.
docsplit title path/to/stooges.pdf => Disorder in the Court
Docsplit.extract_length('path/to/stooges.pdf') => 36
Under the hood, Docsplit is a thin wrapper around the excellent GraphicsMagick, Poppler, PDFTK, Tesseract, and LibreOffice libraries. Poppler is used to extract text and metadata from PDF documents, PDFTK is used to split them apart into pages, and GraphicsMagick is used to generate the page images (internally, it's rendering them with GhostScript). LibreOffice and GraphicsMagick convert documents and images to PDF. Tesseract provides the transparent OCR fallback support, if the document is a simple scan, and the file doesn't contain any embedded text.
Because documents need to be in PDF format before any metadata, text, or images are extracted, it's faster to use docsplit pdf to convert it up front, if you're planning to run more than one extraction. Otherwise Docsplit will write out the PDF version to a temporary file before proceeding with each command.
0.7.6 – Nov. 16, 2014
Docsplit will now automatically use Tesseract's orientation detection model
if it is installed.
0.7.5 – May 28, 2014
Docsplit will detect PDFs regardless of extension using magic number-based
detection.
0.7.2 – Feb. 23, 2013
Bug fixes for LibreOffice support.
0.7.0 – Feb. 23, 2013
Docsplit now expresses a preference for LibreOffice over OpenOffice, with
an eye to removing JODConverter and OpenOffice support in future versions
(direct LibreOffice support is substantially faster than JODConverter).
Improved unicode support now correctly collects non-ascii characters from
pdfinfo.
0.6.4 – Nov. 12, 2012
Added a language flag for the Docsplit commandline, fixed several bugs,
and began preparations for the deprecation of pdftk.
0.6.2 – Nov. 22, 2011
Bugfix to escape document names during file type detection.
0.6.1 – Nov. 18, 2011
Docsplit now supports converting documents using LibreOffice
as well as OpenOffice, through JODConverter 3.0 beta4.
0.6.0 – Sept. 13, 2011
Docsplit should now handle shelling out for documents with arbitrary
characters in their filenames correctly, thanks to a series of
epic patches from Vladimir Rybas.
A --density option was added for specifying the resolution of
rasterization when generating images from documents.
The image resolution for OCR has been doubled from 200 to 400 DPI —
this shouldn't make a noticeable difference for normal docs, but will make
a world of difference for the fine print.
Docsplit now uses GraphicsMagick's --despeckle before OCR.
0.5.2 – May 13, 2011
For transparent conversion to PDF, made Docsplit prefer GraphicsMagick
over OpenOffice, when the file format is one that GraphicsMagick is able
to read: (png, gif, jpg, jpeg, tif, tiff, bmp, pnm, ppm, svg, eps).
0.5.1 – April 26, 2011
Minor tweaks to the TextCleaner to be more lenient about acryonms
with hyphens, and words with four vowels in a row.
0.5.0
Added a Docsplit::TextCleaner class which is used to post-process
OCR'd text, and remove garbage characters that are created when Tesseract
encounters non-english text. To disable the cleanup, pass --no-clean.
0.4.1
Upgraded the JODConverter dependency for PDF conversion via OpenOffice to
3.0 beta. Added PNG, GIF, TIF, JPG, and BMP to the list of supported
formats.
0.3.4
Adding a suggested optimization from the GraphicsMagick list -- only ever
generate one page image per GraphicsMagick call. Saves large amounts of
disk space for tempfiles on long documents.
0.3.3
Start using the MAGICK_TMPDIR environment variable to prevent parallel
Docsplit runs from having the potential to clobber each other's temporary
image files.
0.3.1
Added a memory limit to GraphicsMagick while generating the TIFFs for
Tesseract OCR -- prevents gm from gobbling up all available memory
on large files.
0.3.0
OCR support added via Tesseract, and the --ocr and --no-ocr
flags. PDFBox is no longer a dependency, and the gem is many megabytes
lighter for it.
0.2.0
Moving to Poppler's pdftotext. PDFBox had issues with Unicode in PDFs
and incorrectly split individual pages of text.
0.1.3
Fixing a bug with specifying explicit page ranges for image extraction.
0.1.2
Limiting the memory usage of GraphicsMagick to avoid out of memory errors
on very large PDFs.
0.1.1
Upgraded for compatibility with GraphicsMagick 1.3.11.
0.1.0
Initial Docsplit release.