Difference between revisions of "PdfIndexer"
Jump to navigation
Jump to search
Line 9: | Line 9: | ||
}} | }} | ||
= Motivation = | = Motivation = | ||
− | In one of our project we were asked to check a few dozen PDF documents for consistency. So we needed a way to cross-reference the documents and find keywords. At the time there was no {{Link|target=SimpleGraph}} project yet and we created a special solution | + | In one of our project we were asked to check a few dozen PDF documents for consistency. So we needed a way to cross-reference the documents and find keywords. At the time there was no {{Link|target=SimpleGraph}} project yet and we created a special solution and made it available as OpenSource. |
[[Category:frontend]] | [[Category:frontend]] | ||
+ | |||
= Using in Docker = | = Using in Docker = | ||
In [https://github.com/WolfgangFahl/pdfindexer/issues/4 Issue #4] [https://github.com/peebles peebles] asked how the example would be run in a docker container. | In [https://github.com/WolfgangFahl/pdfindexer/issues/4 Issue #4] [https://github.com/peebles peebles] asked how the example would be run in a docker container. |
Revision as of 07:14, 1 September 2018
OsProject | |
---|---|
edit | |
id | pdfindexer |
state | |
owner | WolfgangFahl |
title | Java Library and Tool to Index and search PDF files using Apache Lucene and PDF Box |
url | https://github.com/WolfgangFahl/pdfindexer |
version | 0.0.11 |
description | |
date | 2018/08/22 |
since | |
until |
Motivation
In one of our project we were asked to check a few dozen PDF documents for consistency. So we needed a way to cross-reference the documents and find keywords. At the time there was no SimpleGraph project yet and we created a special solution and made it available as OpenSource.
Using in Docker
In Issue #4 peebles asked how the example would be run in a docker container.
Start an OpenJDK8 container mounting the pdfindexer directory
# get a fresh version of the PDF Indexer
git clone https://github.com/WolfgangFahl/pdfindexer
# change to the directory
cd pdfindexer
# run a docker Container with OpenJDK Java 8
docker run --rm -it -v $(pwd):/deploy -w /deploy openjdk:8 bash
Trying things out
root@d6113050b3c6:/deploy# mkdir test/html
root@d6113050b3c6:/deploy# java -jar pdfindex.jar --sourceFileList test/pdffiles.lst --idxfile test/index2 --outputfile test/html/pdfindex.html --searchKeyWordList test/searchwords.txt --root test/
adding test/pdfsource1/LoremIpsum.pdf to index
Aug 22, 2018 2:43:03 PM org.apache.pdfbox.pdfparser.NonSequentialPDFParser checkXrefOffsets
SEVERE: Can't find the object 8 0 (origin offset 0)
creating output test/html/pdfindex.html
root@d6113050b3c6:/deploy# wc -l test/html/pdfindex.html
473 test/html/pdfindex.html