Genwiki2024

From BITPlan Wiki
Revision as of 08:03, 5 March 2025 by Wf (talk | contribs) (→‎Demos)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

OsProject

OsProject
id  Genwiki2024
state  active
owner  WolfgangFahl
title  Genwiki
url  https://github.com/WolfgangFahl/genwiki2024
version  0.0.4
description  Genealogy Semantification
date  2025-03-04
since  2024-08-15
until  

tickets

Installation

pip install genwiki2024
# alternatively if your pip is not a python3 pip
pip3 install genwiki2024 
# local install from source directory of genwiki2024 
pip install .

upgrade

pip install genwiki2024  -U
# alternatively if your pip is not a python3 pip
pip3 install genwiki2024 -U


Introduction

DjVu is a computer file format created by Léon Bottou and others with the goal to efficiently store scanned documents. At the time of its creation the priority was to save disk space and computer resources since these were precious and limited.

DjVu has elaborate mechanisms to provide this efficiency even to academic perfection. There is a tradeoff in this - DjVu files can only be processed with DjVu tools and there is basically only a single implementation available: DjvuLibre. The DjVuLibre git repository has not been touched for 8 years as of 2025 and it only has 23 stars. The original SourceForge repository https://djvu.sourceforge.net/ saw its latest release in 2020.

While DjVu is a great technology and a very useful special purpose solution, it is not optimal for modern AI pipelines and other applications where other aspects of processing play a more important role.

djvuconv therefore repackages DjVu files to tarballs including the lossless PNG version of each page - a "table of contents" in YAML format and optionally prepackaged thumbnails in JPEG format.

djvuconv is written in python but makes use of the original DjVuLibre library using the python-djvulibre library by Friedrich Fröbel which is a fork of the python-djvulibre library with Copyright © 2010-2021 Jakub Wilk <jwilk@jwilk.net> and GNU General Public License version 2 which was archive by Jakub in 2022.

Motivation

DjVu support in MediaWiki has been degrading over the past few years see

Goals

  • create a simple cross platform format for scanned documents
  • use PNG, JPEG, PDF and YAML formats which have plenty of tools for viewing, processing and handling
  • add native OCR support
  • add native AI/LLM support

djvu catalog and remote viewer

Demos

API

AI experiments

see https://cr.bitplan.com/index.php/Genealogy2025-02-28 and https://cr.bitplan.com/index.php/Genealogy2024-09-19

djvuconv

djvuconv converts collection of DjVu files to tarballs. It create optionally collects the metadata which may be stored in a database for further analysis and processing.

Command Line Usage

djvuconv -h
usage: djvuconv [-h] [--base-path BASE_PATH] [--batch-size BATCH_SIZE]
                --command {catalog,convert,thumbnails,dbupdate} [-d]
                [--db-path DB_PATH] [-f] [--limit LIMIT]
                [--max-errors MAX_ERRORS] [--max-workers MAX_WORKERS]
                [--output-path OUTPUT_PATH] [--serial] [--sort {asc,desc}]
                [-v] [--url URL]

Process DjVu files

options:
  -h, --help            show this help message and exit
  --base-path BASE_PATH
                        Base path for DjVu files
  --batch-size BATCH_SIZE
                        Number of pages to process in each batch (default:
                        100)
  --command {catalog,convert,thumbnails,dbupdate}
                        Command to execute
  -d, --debug           Enable debugging
  --db-path DB_PATH     Path to the database
  -f, --force           Force recreation
  --limit LIMIT         Maximum number of pages to process
  --max-errors MAX_ERRORS
                        Maximum allowed error percentage before skipping
                        database update
  --max-workers MAX_WORKERS
                        Maximum number of worker threads (default: CPU count *
                        4)
  --output-path OUTPUT_PATH
                        Path for PNG files
  --serial              Use serial processing - parallel is default
  --sort {asc,desc}     Sort by page count (asc=smallest first)
  -v, --verbose         Enable debugging
  --url URL             Process a single DjVu file (only valid in convert
                        mode)