Genwiki2024
OsProject
OsProject | |
---|---|
id | Genwiki2024 |
state | active |
owner | WolfgangFahl |
title | Genwiki |
url | https://github.com/WolfgangFahl/genwiki2024 |
version | 0.0.4 |
description | Genealogy Semantification |
date | 2025-03-04 |
since | 2024-08-15 |
until |
tickets
Installation
pip install genwiki2024
# alternatively if your pip is not a python3 pip
pip3 install genwiki2024
# local install from source directory of genwiki2024
pip install .
upgrade
pip install genwiki2024 -U
# alternatively if your pip is not a python3 pip
pip3 install genwiki2024 -U
Introduction
DjVu is a computer file format created by Léon Bottou and others with the goal to efficiently store scanned documents. At the time of its creation the priority was to save disk space and computer resources since these were precious and limited.
DjVu has elaborate mechanisms to provide this efficiency even to academic perfection. There is a tradeoff in this - DjVu files can only be processed with DjVu tools and there is basically only a single implementation available: DjvuLibre. The DjVuLibre git repository has not been touched for 8 years as of 2025 and it only has 23 stars. The original SourceForge repository https://djvu.sourceforge.net/ saw its latest release in 2020.
While DjVu is a great technology and a very useful special purpose solution, it is not optimal for modern AI pipelines and other applications where other aspects of processing play a more important role.
djvuconv therefore repackages DjVu files to tarballs including the lossless PNG version of each page - a "table of contents" in YAML format and optionally prepackaged thumbnails in JPEG format.
djvuconv is written in python but makes use of the original DjVuLibre library using the python-djvulibre library by Friedrich Fröbel which is a fork of the python-djvulibre library with Copyright © 2010-2021 Jakub Wilk <jwilk@jwilk.net> and GNU General Public License version 2 which was archive by Jakub in 2022.
Motivation
DjVu support in MediaWiki has been degrading over the past few years see
- wikimedia phabricator page
- DjVu thumbnails are not being generated at all
- Stackoverflow Question:DjVuImage::getMultiPageInfo: multi-page DJVU file contained no pages
Goals
- create a simple cross platform format for scanned documents
- use PNG, JPEG, PDF and YAML formats which have plenty of tools for viewing, processing and handling
- add native OCR support
- add native AI/LLM support
djvu catalog and remote viewer
Demos
- https://genwiki2024.bitplan.com - only a single file available Minden-AB-1939
- https://genwiki2024.bitplan.com/djvu/Minden-AB-1939.djvu
- https://genwiki2024.bitplan.com/djvu/Minden-AB-1939.djvu/page/0.1/15.jpg
API
AI experiments
see https://cr.bitplan.com/index.php/Genealogy2025-02-28 and https://cr.bitplan.com/index.php/Genealogy2024-09-19
djvuconv
djvuconv converts collection of DjVu files to tarballs. It create optionally collects the metadata which may be stored in a database for further analysis and processing.
Command Line Usage
djvuconv -h
usage: djvuconv [-h] [--base-path BASE_PATH] [--batch-size BATCH_SIZE]
--command {catalog,convert,thumbnails,dbupdate} [-d]
[--db-path DB_PATH] [-f] [--limit LIMIT]
[--max-errors MAX_ERRORS] [--max-workers MAX_WORKERS]
[--output-path OUTPUT_PATH] [--serial] [--sort {asc,desc}]
[-v] [--url URL]
Process DjVu files
options:
-h, --help show this help message and exit
--base-path BASE_PATH
Base path for DjVu files
--batch-size BATCH_SIZE
Number of pages to process in each batch (default:
100)
--command {catalog,convert,thumbnails,dbupdate}
Command to execute
-d, --debug Enable debugging
--db-path DB_PATH Path to the database
-f, --force Force recreation
--limit LIMIT Maximum number of pages to process
--max-errors MAX_ERRORS
Maximum allowed error percentage before skipping
database update
--max-workers MAX_WORKERS
Maximum number of worker threads (default: CPU count *
4)
--output-path OUTPUT_PATH
Path for PNG files
--serial Use serial processing - parallel is default
--sort {asc,desc} Sort by page count (asc=smallest first)
-v, --verbose Enable debugging
--url URL Process a single DjVu file (only valid in convert
mode)