PdfIndexer

From BITPlan Wiki
Revision as of 07:14, 1 September 2018 by Wf (talk | contribs) (→‎Motivation)
Jump to navigation Jump to search
OsProject
edit
id  pdfindexer
state  
owner  WolfgangFahl
title  Java Library and Tool to Index and search PDF files using Apache Lucene and PDF Box
url  https://github.com/WolfgangFahl/pdfindexer
version  0.0.11
description  
date  2018/08/22
since  
until  

Motivation

In one of our project we were asked to check a few dozen PDF documents for consistency. So we needed a way to cross-reference the documents and find keywords. At the time there was no SimpleGraph project yet and we created a special solution and made it available as OpenSource.

Using in Docker

In Issue #4 peebles asked how the example would be run in a docker container.

Start an OpenJDK8 container mounting the pdfindexer directory

# get a fresh version of the PDF Indexer
git clone https://github.com/WolfgangFahl/pdfindexer
# change to the directory
cd pdfindexer
# run a docker Container with OpenJDK Java 8
docker run --rm -it -v $(pwd):/deploy -w /deploy openjdk:8 bash

Trying things out

root@d6113050b3c6:/deploy# mkdir test/html
root@d6113050b3c6:/deploy# java -jar pdfindex.jar --sourceFileList test/pdffiles.lst --idxfile test/index2 --outputfile test/html/pdfindex.html --searchKeyWordList test/searchwords.txt --root test/
adding test/pdfsource1/LoremIpsum.pdf to index
Aug 22, 2018 2:43:03 PM org.apache.pdfbox.pdfparser.NonSequentialPDFParser checkXrefOffsets
SEVERE: Can't find the object 8 0 (origin offset 0)
creating output test/html/pdfindex.html
root@d6113050b3c6:/deploy# wc -l test/html/pdfindex.html
473 test/html/pdfindex.html