Difference between revisions of "Wikidata Import 2025-05-02"

From BITPlan Wiki
Jump to navigation Jump to search
Line 339: Line 339:
 
==== followup 2025-06-04 ====
 
==== followup 2025-06-04 ====
 
see also https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
 
see also https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
<source lang='bash' highlight='1'>
+
<source lang='bash' highlight='1-2'>
while true; do /wdqs/runUpdate.sh; sleep 10; done
+
docker exec -it private-wikidata-query-wdqs-1 /bin/bash
 +
while true; do /wdqs/runUpdate.sh; sleep 10; done&
 
</source>
 
</source>
 
<pre>
 
<pre>

Revision as of 16:42, 4 June 2025

Import

Import
edit
state  
url  https://wiki.bitplan.com/index.php/Wikidata_Import_2025-05-02
target  blazegraph
start  2025-05-02
end  2025-05-03
days  0.6
os  Ubuntu 22.04.3 LTS
cpu  Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz (16 cores)
ram  512
triples  
comment  


This "import" is not using a dump and indexing approach but directly copying a blazegraph journal file.

Steps

Copy journal file

Source https://scatter.red/ wikidata installation. Usimng aria2c with 16 connections the copy initially took some 5 hours but was interrrupted. Since aria2c was used in preallocation mode and the script final message was "download finished" the file looked complete which it was not.

git clone the priv-wd-query

git clone https://github.com/scatter-llc/private-wikidata-query
mkdir data
mv data.jnl private-wikidata-query/data
cd private-wikidata-query/data
# use proper uid and gid as per the containers preferences
chown 666:66 data.jnl
jh@wikidata:/hd/delta/blazegraph/private-wikidata-query/data$ ls -l
total 346081076
-rw-rw-r-- 1 666 66 1328514809856 May  2 22:07 data.jnl

start docker

docker compose up -d
WARN[0000] /hd/delta/blazegraph/private-wikidata-query/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Running 3/3
 ✔ Container private-wikidata-query-wdqs-1           Started               0.4s 
 ✔ Container private-wikidata-query-wdqs-proxy-1     Started               0.7s 
 ✔ Container private-wikidata-query-wdqs-frontend-1  Started               1.1s
docker ps | grep wdqs
36dad88ebfdc   wikibase/wdqs-frontend:wmde.11      "/entrypoint.sh ngin…"   About an hour ago   Up 3 minutes                    0.0.0.0:8099->80/tcp, [::]:8099->80/tcp                           private-wikidata-query-wdqs-frontend-1
f0d273cca376   caddy                               "caddy run --config …"   About an hour ago   Up 3 minutes                    80/tcp, 443/tcp, 2019/tcp, 443/udp                                private-wikidata-query-wdqs-proxy-1
d86124984e0f   wikibase/wdqs:0.3.97-wmde.8         "/entrypoint.sh /run…"   About an hour ago   Up 3 minutes                    0.0.0.0:9999->9999/tcp, [::]:9999->9999/tcp                       private-wikidata-query-wdqs-1
6011f5c1cc03   caddy                               "caddy run --config …"   12 months ago       Up 3 days                       80/tcp, 443/tcp, 2019/tcp, 443/udp                                wdqs-wdqs-proxy-1

Incompatible RWStore header version

see https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/rwstore/RWStore.java

docker logs private-wikidata-query-wdqs-1 2>&1 | grep -m 1 "Incompatible RWStore header version"
java.lang.RuntimeException: java.lang.IllegalStateException: Incompatible RWStore header version: storeVersion=0, cVersion=1024, demispace: true
docker exec -it private-wikidata-query-wdqs-1 /bin/bash
diff RWStore.properties RWStore.properties.bak-20250503 
--- RWStore.properties
+++ RWStore.properties.bak-20250503
@@ -56,6 +56,3 @@
    {"valueType":"DOUBLE","multiplier":"1000000000","serviceMapping":"LATITUDE"},\
    {"valueType":"LONG","multiplier":"1","minValue":"0","serviceMapping":"COORD_SYSTEM"}\
   ]}}
-
-# Added to fix Incompatible RWStore header version error
-com.bigdata.rwstore.RWStore.readBlobsAsync=false
docker compose restart wdqs
WARN[0000] /hd/delta/blazegraph/private-wikidata-query/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Restarting 1/1
 ✔ Container private-wikidata-query-wdqs-1  Started                      11.1s

Download multiple copies in parallel

The script below was named get and used to copy two /hd/delta and /hd/beta in parallel totalling 32 connections.

#!/bin/bash
# Corrected aria2 download script (no prealloc, true progress)

ISO_DATE=$(date +%Y%m%d)
DISK=beta
SESSION="bg_${DISK}_${ISO_DATE}"
FILENAME="data.jnl"
DOWNLOAD_DIR="/hd/$DISK/blazegraph"
URL="https://datasets.orbopengraph.com/blazegraph/data.jnl"
CONNECTIONS=16

# Kill stale session
if screen -list | grep -q "$SESSION"; then
  echo "⚠️ Killing existing session '$SESSION'..."
  screen -S "$SESSION" -X quit
fi

# Launch corrected aria2 download
screen -dmS "$SESSION" -L bash -c "
  echo '[INFO] Starting corrected download...';
  cd \"$DOWNLOAD_DIR\";
  aria2c -c -x $CONNECTIONS -s $CONNECTIONS \
    --file-allocation=none \
    --auto-file-renaming=false \
    --dir=\"$DOWNLOAD_DIR\" \
    --out=\"$FILENAME\" \
    \"$URL\";
  EXIT_CODE=\$?;
  if [ \$EXIT_CODE -eq 0 ]; then
    echo '[INFO] Download finished.';
    md5sum \"$FILENAME\" > \"$FILENAME.md5\";
    echo '[INFO] MD5 saved to $FILENAME.md5';
  else
    echo '[ERROR] Download failed with code \$EXIT_CODE';
  fi
"

echo "✅ Fixed download started in screen '$SESSION'."
echo "Monitor with: screen -r $SESSION"

md5 check

on source:

 md5sum e891800af42b979159191487910bd9ae  data.jnl 

check script

at 95% progress

tail -f screenlog.0
FILE: /hd/delta/blazegraph/data.jnl
-------------------------------------------------------------------------------

 *** Download Progress Summary as of Mon May  5 07:36:28 2025 ***              
===============================================================================
[#152de7 1,241GiB/1,296GiB(95%) CN:16 DL:16MiB ETA:55m32s]
FILE: /hd/delta/blazegraph/data.jnl
-------------------------------------------------------------------------------

[#152de7 1,242GiB/1,296GiB(95%) CN:16 DL:21MiB ETA:43m39s]
./check 
Comparing:
  file1 → /hd/beta/blazegraph/data.jnl
  file2 → /hd/delta/blazegraph/data.jnl
  blocksize=1MB start=0MB

== log-5 ==
[  0]       1 MB  ✅  MD5 match
[  1]       5 MB  ✅  MD5 match
[  2]      25 MB  ✅  MD5 match
[  3]     125 MB  ✅  MD5 match
[  4]     625 MB  ✅  MD5 match
[  5]   3,125 MB  ✅  MD5 match
[  6]  15,625 MB  ✅  MD5 match
[  7]  78,125 MB  ✅  MD5 match
[  8] 390,625 MB  ✅  MD5 match

Summary: {'✅': 9}

== log-2 ==
100%|██████████████████████████████████████████| 21/21 [00:00<00:00, 212.25it/s]

Summary: {'✅': 21}

== linear-2000 ==
100%|████████████████████████████████████████| 663/663 [00:05<00:00, 116.57it/s]

Summary: {'✅': 607, '⚠️': 56}

== linear-500 ==
100%|██████████████████████████████████████| 2651/2651 [00:21<00:00, 122.85it/s]

Summary: {'✅': 2435, '⚠️': 216}

at 100 %

 tail  -15 screenlog.0 
[#152de7 1,296GiB/1,296GiB(99%) CN:7 DL:12MiB ETA:6s]
FILE: /hd/delta/blazegraph/data.jnl
-------------------------------------------------------------------------------

[#152de7 1,296GiB/1,296GiB(99%) CN:1 DL:1.2MiB ETA:1s]                         
05/05 08:20:15 [NOTICE] Download complete: /hd/delta/blazegraph/data.jnl

Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
152de7|OK  |    19MiB/s|/hd/delta/blazegraph/data.jnl

Status Legend:
(OK):download completed.
[INFO] Download finished.
tail  -15 screenlog.0 
[#50fb05 1,296GiB/1,296GiB(99%) CN:13 DL:19MiB ETA:6s]
FILE: /hd/beta/blazegraph/data.jnl
-------------------------------------------------------------------------------

[#50fb05 1,296GiB/1,296GiB(99%) CN:1 DL:1.5MiB]                                
05/05 08:15:19 [NOTICE] Download complete: /hd/beta/blazegraph/data.jnl

Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
50fb05|OK  |    18MiB/s|/hd/beta/blazegraph/data.jnl

Status Legend:
(OK):download completed.
[INFO] Download finished.

fix aria issues using blockdownload

see https://github.com/WolfgangFahl/blockdownload

check the aria downloaded journal file hashes - heads only

dcheck \
  --head-only \
  --url https://datasets.orbopengraph.com/blazegraph/data.jnl \
  /hd/beta/blazegraph/data.jnl \
  /hd/delta/blazegraph/data.jnl
Processing /hd/beta/blazegraph/data.jnl... (file size: 1296.92 GB)
Block 2656/2656 1296.88-1296.92 GB: : 1.39TB [00:00, 4.88TB/s]                                   
/hd/beta/blazegraph/data.jnl.yaml created with 2657 blocks (1328045.19 MB processed)
Processing /hd/delta/blazegraph/data.jnl... (file size: 1296.92 GB)
Block 2656/2656 1296.88-1296.92 GB: : 1.39TB [00:00, 4.77TB/s]                                   
/hd/delta/blazegraph/data.jnl.yaml created with 2657 blocks (1328045.19 MB processed)
265700⚠️: : 1.39TB [00:00, 6.80TB/s]                                                       ]

Final: 265700⚠️

check with full md5 500 MB block checks

dcheck \
  --url https://datasets.orbopengraph.com/blazegraph/data.jnl \
  /hd/beta/blazegraph/data.jnl \
  /hd/delta/blazegraph/data.jnl
Processing /hd/beta/blazegraph/data.jnl... (file size: 1296.92 GB)
Processing data.jnl:  62%|██████████████████████▏             | 857G/1.39T [26:34<16:24, 544MB/s]
grep block blazegraph.yaml  | wc -l
2652

check parts

md5sum blazegraph-2656.part 
493923964d5840438d9d06e560eaf15b  blazegraph-2656.part
wf@wikidata:/hd/eneco/blazegraph/blazegraph-2025-05-06$ grep 493923964d5840438d9d06e560eaf15b blazegraph.yaml -B2
  path: blazegraph-2656.part
  offset: 1392508928000
  md5: 493923964d5840438d9d06e560eaf15b

check missing blocks

grep block blazegraph.yaml | cut -f2 -d: | sort -un | awk 'NR==1{prev=$1; next} {for(i=prev+1;i<$1;i++) print i; prev=$1}'
2
4
7
30
131
209
642

reassemble

first try

blockdownload --yaml blazegraph.yaml --output data.jnl --progress --name wikidata https://datasets.orbopengraph.com/blazegraph/data.jnl .
Creating target: 100%|███████████████████▉| 1.39T/1.39T [1:34:08<00:11, 310MB/s]created data.jnl - 1324545.19 MB
md5: 2106c6ae22ff4425b35834ad7bb65b07
File reassembled successfully: data.jnl
Creating target: 100%|███████████████████▉| 1.39T/1.39T [1:34:09<00:14, 246MB/s]

wrong md5sum as expected with missing blocks

attempt 2025-05-23

cat bdy
# reassemble 
DATE="2025-05-23"

blockdownload https://datasets.orbopengraph.com/blazegraph/data.jnl \
  blazegraph-$DATE \
  --name blazegraph \
  --progress \
  --yaml blazegraph-$DATE/data.jnl.yaml \
  --output /hd/mantax/blazegraph/data.jnl
./bdy
Creating target: 100%|██████████| 1.39T/1.39T [1:00:24<00:00, 384MB/s]
created /hd/mantax/blazegraph/data.jnl - 1328045.19 MB
md5: 1f6a822d9015aa3356714d304eeb8471
File reassembled successfully: /hd/mantax/blazegraph/data.jnl
wf@wikidata:/hd/eneco/blazegraph/blazegraph-2025-05-23$ head -1 md5sums.txt 
1f6a822d9015aa3356714d304eeb8471  data.jnl
ls -l /hd/mantax/blazegraph/data.jnl 
-rw-rw-r-- 1 wf wf 1392556310528 May 24 13:39 data.jnl
md5sum /hd/mantax/blazegraph/data.jnl 
1f6a822d9015aa3356714d304eeb8471 data.jnl

follow-up 2025-05-31

All the following was done in tmux session 0 under the `jh` user.

With MD5 checksum verified, launch with Docker Compose and launch a bash shell for the `wdqs` container (the one running Blazegraph).

docker compose up --build -d

docker compose exec wdqs bash

Inside the container, run this command to catch the database up to the present day:

while true; do /wdqs/runUpdate.sh; sleep 10; done

This creates an infinite loop that will continuously re-start the update script if it crashes for any reason. In most cases, script crashes can be overcome by simply restarting the script. The `sleep 10` is so you are not spammed with restart attempts.

As of 05:05, 31 May 2025 (CEST) the data.jnl on /hd/delta is updated through the end of 2025-05-15. Wait a few days for it to catch up to present.

Next steps:

  1. Verify frontend can connect to query service
  2. Modify the retry loop one-liner to include logging of error events
  3. Formalize the one-liner into a script or Docker Compose configuration

followup 2025-06-03

  • created apache config wdqs.conf with proxy to 8099
  • started frontend
root@wikidata:/hd/delta/blazegraph/private-wikidata-query#docker compose ps
WARN[0000] /hd/delta/blazegraph/private-wikidata-query/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
NAME                                  IMAGE     COMMAND                  SERVICE      CREATED       STATUS        PORTS
private-wikidata-query-wdqs-proxy-1   caddy     "caddy run --config …"   wdqs-proxy   4 weeks ago   Up 11 hours   80/tcp, 443/tcp, 2019/tcp, 443/udp
root@wikidata:/hd/delta/blazegraph/private-wikidata-query# docker compose up -d
WARN[0000] /hd/delta/blazegraph/private-wikidata-query/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Running 3/3
 ✔ Container private-wikidata-query-wdqs-1           Started                               0.3s 
 ✔ Container private-wikidata-query-wdqs-proxy-1     Running                               0.0s 
 ✔ Container private-wikidata-query-wdqs-frontend-1  Started                               0.7s 
root@wikidata:/hd/delta/blazegraph/private-wikidata-query# curl http://localhost:8099
<!DOCTYPE html><html lang="en" dir="ltr"><head><meta charset="utf-8"

followup 2025-06-04

see also https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours

docker exec -it private-wikidata-query-wdqs-1 /bin/bash
while true; do /wdqs/runUpdate.sh; sleep 10; done&
15:33:59.522 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2025-06-01T21:10:22Z (next: 20250601211022|2427659504) at (10.6, 6.5, 2.9) updates per second and (3412.9, 2094.7, 941.5) milliseconds per second
15:33:59.770 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 81 changes, from Q134675408@2355787733@20250601211022|2427659504 to Q127768397@2355787831@20250601211035|2427659605
15:34:08.856 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2025-06-01T21:10:35Z (next: 20250601211035|2427659606) at (10.0, 6.5, 3.0) updates per second and (3095.0, 2071.5, 946.5) milliseconds per second
15:34:09.096 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 76 changes, from Q134675444@2355787832@20250601211035|2427659606 to Q134642638@2355787935@20250601211052|2427659706
15:34:13.024 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2025-06-01T21:10:52Z (next: 20250601211052|2427659707) at (10.3, 6.6, 3.0) updates per second and (3055.4, 2080.3, 955.7) milliseconds per second
15:34:13.268 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 78 changes, from Q134675474@2355787931@20250601211052|2427659707 to Q95527928@2355788033@20250601211106|2427659806

update check query

# check updated state of copy of wikidata
#
# see https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
# and https://wiki.bitplan.com/index.php/Wikidata_Import_2025-05-02#followup_2025-06-04
PREFIX schema: <http://schema.org/>

# This query returns:
# - the total number of triples in the dataset
# - the dateModified of the <http://www.wikidata.org> entity, if available
SELECT * WHERE {
  {
    # Subquery: count all triples in the dataset
    SELECT (COUNT(*) AS ?count) {
      ?s ?p ?o
    }
  }
  UNION
  {
    # Subquery: get the schema:dateModified for the Wikidata root URI
    SELECT * WHERE {
      <http://www.wikidata.org> schema:dateModified ?y
    }
  }
}