Difference between revisions of "PdfIndexer"
Jump to navigation
Jump to search
| (7 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
| − | |||
| − | |||
{{OsProject | {{OsProject | ||
|id=pdfindexer | |id=pdfindexer | ||
| Line 6: | Line 4: | ||
|title=Java Library and Tool to Index and search PDF files using Apache Lucene and PDF Box | |title=Java Library and Tool to Index and search PDF files using Apache Lucene and PDF Box | ||
|url=https://github.com/WolfgangFahl/pdfindexer | |url=https://github.com/WolfgangFahl/pdfindexer | ||
| − | |version=0.0. | + | |version=0.0.12 |
| − | |date= | + | |date=2025-10-23 |
| + | |since=2013 | ||
|storemode=property | |storemode=property | ||
}} | }} | ||
| − | = | + | = Motivation = |
| + | In one of our project we were asked to check a few dozen PDF documents for consistency. So we needed a way to cross-reference the documents and find keywords. At the time there was no {{Link|target=SimpleGraph}} project yet and we created a special solution and made it available as OpenSource. | ||
| + | [[Category:frontend]] | ||
| + | |||
| + | = Using in Docker = | ||
| + | In [https://github.com/WolfgangFahl/pdfindexer/issues/4 Issue #4] [https://github.com/peebles peebles] asked how the example would be run in a docker container. | ||
| + | == Start an OpenJDK8 container mounting the pdfindexer directory == | ||
| + | <source lang='bash'> | ||
| + | # get a fresh version of the PDF Indexer | ||
| + | git clone https://github.com/WolfgangFahl/pdfindexer | ||
| + | # change to the directory | ||
| + | cd pdfindexer | ||
| + | # run a docker Container with OpenJDK Java 8 | ||
| + | docker run --rm -it -v $(pwd):/deploy -w /deploy openjdk:8 bash | ||
| + | </source> | ||
| + | |||
| + | === Trying things out === | ||
| + | <source lang='bash'> | ||
| + | root@d6113050b3c6:/deploy# mkdir test/html | ||
| + | root@d6113050b3c6:/deploy# java -jar pdfindex.jar --sourceFileList test/pdffiles.lst --idxfile test/index2 --outputfile test/html/pdfindex.html --searchKeyWordList test/searchwords.txt --root test/ | ||
| + | adding test/pdfsource1/LoremIpsum.pdf to index | ||
| + | Aug 22, 2018 2:43:03 PM org.apache.pdfbox.pdfparser.NonSequentialPDFParser checkXrefOffsets | ||
| + | SEVERE: Can't find the object 8 0 (origin offset 0) | ||
| + | creating output test/html/pdfindex.html | ||
| + | root@d6113050b3c6:/deploy# wc -l test/html/pdfindex.html | ||
| + | 473 test/html/pdfindex.html | ||
| + | </source> | ||
| + | === Character set handling === | ||
| + | https://github.com/WolfgangFahl/pdfindexer/issues/7 reports that utf-8 handling might not work from the command line in certain cases. I assume this is due to Javas choice of | ||
| + | character set being based on the environment (e.g. MacOS/Windows). It is possible to override this setting see https://stackoverflow.com/a/10890594/1497139. | ||
Latest revision as of 12:48, 23 October 2025
| OsProject | |
|---|---|
| id | pdfindexer |
| state | |
| owner | WolfgangFahl |
| title | Java Library and Tool to Index and search PDF files using Apache Lucene and PDF Box |
| url | https://github.com/WolfgangFahl/pdfindexer |
| version | 0.0.12 |
| description | |
| date | 2025-10-23 |
| since | 2013 |
| until | |
Motivation
In one of our project we were asked to check a few dozen PDF documents for consistency. So we needed a way to cross-reference the documents and find keywords. At the time there was no SimpleGraph project yet and we created a special solution and made it available as OpenSource.
Using in Docker
In Issue #4 peebles asked how the example would be run in a docker container.
Start an OpenJDK8 container mounting the pdfindexer directory
# get a fresh version of the PDF Indexer
git clone https://github.com/WolfgangFahl/pdfindexer
# change to the directory
cd pdfindexer
# run a docker Container with OpenJDK Java 8
docker run --rm -it -v $(pwd):/deploy -w /deploy openjdk:8 bash
Trying things out
root@d6113050b3c6:/deploy# mkdir test/html
root@d6113050b3c6:/deploy# java -jar pdfindex.jar --sourceFileList test/pdffiles.lst --idxfile test/index2 --outputfile test/html/pdfindex.html --searchKeyWordList test/searchwords.txt --root test/
adding test/pdfsource1/LoremIpsum.pdf to index
Aug 22, 2018 2:43:03 PM org.apache.pdfbox.pdfparser.NonSequentialPDFParser checkXrefOffsets
SEVERE: Can't find the object 8 0 (origin offset 0)
creating output test/html/pdfindex.html
root@d6113050b3c6:/deploy# wc -l test/html/pdfindex.html
473 test/html/pdfindex.html
Character set handling
https://github.com/WolfgangFahl/pdfindexer/issues/7 reports that utf-8 handling might not work from the command line in certain cases. I assume this is due to Javas choice of character set being based on the environment (e.g. MacOS/Windows). It is possible to override this setting see https://stackoverflow.com/a/10890594/1497139.