PdfIndexer
Jump to navigation
Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
OsProject | |
---|---|
edit | |
id | pdfindexer |
state | |
owner | WolfgangFahl |
title | Java Library and Tool to Index and search PDF files using Apache Lucene and PDF Box |
url | https://github.com/WolfgangFahl/pdfindexer |
version | 0.0.11 |
description | |
date | 2018/08/22 |
since | |
until |
Motivation
In one of our project we were asked to check a few dozen PDF documents for consistency. So we needed a way to cross-reference the documents and find keywords. At the time there was no SimpleGraph project yet and we created a special solution and made it available as OpenSource.
Using in Docker
In Issue #4 peebles asked how the example would be run in a docker container.
Start an OpenJDK8 container mounting the pdfindexer directory
# get a fresh version of the PDF Indexer
git clone https://github.com/WolfgangFahl/pdfindexer
# change to the directory
cd pdfindexer
# run a docker Container with OpenJDK Java 8
docker run --rm -it -v $(pwd):/deploy -w /deploy openjdk:8 bash
Trying things out
root@d6113050b3c6:/deploy# mkdir test/html
root@d6113050b3c6:/deploy# java -jar pdfindex.jar --sourceFileList test/pdffiles.lst --idxfile test/index2 --outputfile test/html/pdfindex.html --searchKeyWordList test/searchwords.txt --root test/
adding test/pdfsource1/LoremIpsum.pdf to index
Aug 22, 2018 2:43:03 PM org.apache.pdfbox.pdfparser.NonSequentialPDFParser checkXrefOffsets
SEVERE: Can't find the object 8 0 (origin offset 0)
creating output test/html/pdfindex.html
root@d6113050b3c6:/deploy# wc -l test/html/pdfindex.html
473 test/html/pdfindex.html
Character set handling
https://github.com/WolfgangFahl/pdfindexer/issues/7 reports that utf-8 handling might not work from the command line in certain cases. I assume this is due to Javas choice of character set being based on the environment (e.g. MacOS/Windows). It is possible to override this setting see https://stackoverflow.com/a/10890594/1497139.