Latest revision as of 14:39, 2 July 2020

OsProject
edit
id	pdfindexer
state
owner	WolfgangFahl
title	Java Library and Tool to Index and search PDF files using Apache Lucene and PDF Box
url	https://github.com/WolfgangFahl/pdfindexer
version	0.0.11
description
date	2018/08/22
since
until

Motivation

In one of our project we were asked to check a few dozen PDF documents for consistency. So we needed a way to cross-reference the documents and find keywords. At the time there was no SimpleGraph project yet and we created a special solution and made it available as OpenSource.

Using in Docker

In Issue #4 peebles asked how the example would be run in a docker container.

Start an OpenJDK8 container mounting the pdfindexer directory

# get a fresh version of the PDF Indexer
git clone https://github.com/WolfgangFahl/pdfindexer
# change to the directory
cd pdfindexer
# run a docker Container with OpenJDK Java 8
docker run --rm -it -v $(pwd):/deploy -w /deploy openjdk:8 bash

Trying things out

root@d6113050b3c6:/deploy# mkdir test/html
root@d6113050b3c6:/deploy# java -jar pdfindex.jar --sourceFileList test/pdffiles.lst --idxfile test/index2 --outputfile test/html/pdfindex.html --searchKeyWordList test/searchwords.txt --root test/
adding test/pdfsource1/LoremIpsum.pdf to index
Aug 22, 2018 2:43:03 PM org.apache.pdfbox.pdfparser.NonSequentialPDFParser checkXrefOffsets
SEVERE: Can't find the object 8 0 (origin offset 0)
creating output test/html/pdfindex.html
root@d6113050b3c6:/deploy# wc -l test/html/pdfindex.html
473 test/html/pdfindex.html

Character set handling

https://github.com/WolfgangFahl/pdfindexer/issues/7 reports that utf-8 handling might not work from the command line in certain cases. I assume this is due to Javas choice of character set being based on the environment (e.g. MacOS/Windows). It is possible to override this setting see https://stackoverflow.com/a/10890594/1497139.

@@ Line 1: / Line 1: @@
-{{OsProject|storemode=property}}
+{{OsProject
+|id=pdfindexer
+|owner=WolfgangFahl
+|title=Java Library and Tool to Index and search PDF files using Apache Lucene and PDF Box
+|url=https://github.com/WolfgangFahl/pdfindexer
+|version=0.0.11
+|date=2018/08/22
+|storemode=property
+}}
+= Motivation =
+In one of our project we were asked to check a few dozen PDF documents for consistency. So we needed a way to cross-reference the documents and find keywords. At the time there was no {{Link|target=SimpleGraph}} project yet and we created a special solution and made it available as OpenSource.
+[[Category:frontend]]
+= Using in Docker =
+In [https://github.com/WolfgangFahl/pdfindexer/issues/4 Issue #4] [https://github.com/peebles peebles] asked how the example would be run in a docker container.
+== Start an OpenJDK8 container mounting the pdfindexer directory ==
+<source lang='bash'>
+# get a fresh version of the PDF Indexer
+git clone https://github.com/WolfgangFahl/pdfindexer
+# change to the directory
+cd pdfindexer
+# run a docker Container with OpenJDK Java 8
+docker run --rm -it -v $(pwd):/deploy -w /deploy openjdk:8 bash
+</source>
+=== Trying things out ===
+<source lang='bash'>
+root@d6113050b3c6:/deploy# mkdir test/html
+root@d6113050b3c6:/deploy# java -jar pdfindex.jar --sourceFileList test/pdffiles.lst --idxfile test/index2 --outputfile test/html/pdfindex.html --searchKeyWordList test/searchwords.txt --root test/
+adding test/pdfsource1/LoremIpsum.pdf to index
+Aug 22, 2018 2:43:03 PM org.apache.pdfbox.pdfparser.NonSequentialPDFParser checkXrefOffsets
+SEVERE: Can't find the object 8 0 (origin offset 0)
+creating output test/html/pdfindex.html
+root@d6113050b3c6:/deploy# wc -l test/html/pdfindex.html
+test/html/pdfindex.html
+</source>
+=== Character set handling ===
+https://github.com/WolfgangFahl/pdfindexer/issues/7 reports that utf-8 handling might not work from the command line in certain cases. I assume this is due to Javas choice of
+character set being based on the environment (e.g. MacOS/Windows). It is possible to override this setting see https://stackoverflow.com/a/10890594/1497139.

Difference between revisions of "PdfIndexer"

Latest revision as of 14:39, 2 July 2020

Contents

Motivation

Using in Docker

Start an OpenJDK8 container mounting the pdfindexer directory

Trying things out

Character set handling

Navigation menu

Search