Difference between revisions of "Wikidata Import 2025-05-02"
| Line 350: | Line 350: | ||
| 15:34:13.268 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 78 changes, from Q134675474@2355787931@20250601211052|2427659707 to Q95527928@2355788033@20250601211106|2427659806 | 15:34:13.268 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 78 changes, from Q134675474@2355787931@20250601211052|2427659707 to Q95527928@2355788033@20250601211106|2427659806 | ||
| </pre> | </pre> | ||
| − | [https://tinyurl.com/ | + | [https://tinyurl.com/2yelk8xf update check query] | 
| + | <source lang='SPARQL'> | ||
| + | # check updated state of copy of wikidata | ||
| + | # | ||
| + | # see https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours | ||
| + | # and https://wiki.bitplan.com/index.php/Wikidata_Import_2025-05-02#followup_2025-06-04 | ||
| + | PREFIX schema: <http://schema.org/> | ||
| + | |||
| + | # This query returns: | ||
| + | # - the total number of triples in the dataset | ||
| + | # - the dateModified of the <http://www.wikidata.org> entity, if available | ||
| + | SELECT * WHERE { | ||
| + |   { | ||
| + |     # Subquery: count all triples in the dataset | ||
| + |     SELECT (COUNT(*) AS ?count) { | ||
| + |       ?s ?p ?o | ||
| + |     } | ||
| + |   } | ||
| + |   UNION | ||
| + |   { | ||
| + |     # Subquery: get the schema:dateModified for the Wikidata root URI | ||
| + |     SELECT * WHERE { | ||
| + |       <http://www.wikidata.org> schema:dateModified ?y | ||
| + |     } | ||
| + |   } | ||
| + | } | ||
| + | </source> | ||
Revision as of 16:40, 4 June 2025
Import
| Import | |
|---|---|
| state | |
| url | https://wiki.bitplan.com/index.php/Wikidata_Import_2025-05-02 | 
| target | blazegraph | 
| start | 2025-05-02 | 
| end | 2025-05-03 | 
| days | 0.6 | 
| os | Ubuntu 22.04.3 LTS | 
| cpu | Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz (16 cores) | 
| ram | 512 | 
| triples | |
| comment | |
This "import" is not using a dump and indexing approach but directly copying a blazegraph journal file.
Steps
Copy journal file
Source https://scatter.red/ wikidata installation. Usimng aria2c with 16 connections the copy initially took some 5 hours but was interrrupted. Since aria2c was used in preallocation mode and the script final message was "download finished" the file looked complete which it was not.
git clone the priv-wd-query
git clone https://github.com/scatter-llc/private-wikidata-query
mkdir data
mv data.jnl private-wikidata-query/data
cd private-wikidata-query/data
# use proper uid and gid as per the containers preferences
chown 666:66 data.jnl
jh@wikidata:/hd/delta/blazegraph/private-wikidata-query/data$ ls -l
total 346081076
-rw-rw-r-- 1 666 66 1328514809856 May  2 22:07 data.jnl
start docker
docker compose up -d
WARN[0000] /hd/delta/blazegraph/private-wikidata-query/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Running 3/3
 ✔ Container private-wikidata-query-wdqs-1           Started               0.4s 
 ✔ Container private-wikidata-query-wdqs-proxy-1     Started               0.7s 
 ✔ Container private-wikidata-query-wdqs-frontend-1  Started               1.1s
docker ps | grep wdqs
36dad88ebfdc   wikibase/wdqs-frontend:wmde.11      "/entrypoint.sh ngin…"   About an hour ago   Up 3 minutes                    0.0.0.0:8099->80/tcp, [::]:8099->80/tcp                           private-wikidata-query-wdqs-frontend-1
f0d273cca376   caddy                               "caddy run --config …"   About an hour ago   Up 3 minutes                    80/tcp, 443/tcp, 2019/tcp, 443/udp                                private-wikidata-query-wdqs-proxy-1
d86124984e0f   wikibase/wdqs:0.3.97-wmde.8         "/entrypoint.sh /run…"   About an hour ago   Up 3 minutes                    0.0.0.0:9999->9999/tcp, [::]:9999->9999/tcp                       private-wikidata-query-wdqs-1
6011f5c1cc03   caddy                               "caddy run --config …"   12 months ago       Up 3 days                       80/tcp, 443/tcp, 2019/tcp, 443/udp                                wdqs-wdqs-proxy-1
Incompatible RWStore header version
docker logs private-wikidata-query-wdqs-1 2>&1 | grep -m 1 "Incompatible RWStore header version"
java.lang.RuntimeException: java.lang.IllegalStateException: Incompatible RWStore header version: storeVersion=0, cVersion=1024, demispace: true
docker exec -it private-wikidata-query-wdqs-1 /bin/bash
diff RWStore.properties RWStore.properties.bak-20250503 
--- RWStore.properties
+++ RWStore.properties.bak-20250503
@@ -56,6 +56,3 @@
    {"valueType":"DOUBLE","multiplier":"1000000000","serviceMapping":"LATITUDE"},\
    {"valueType":"LONG","multiplier":"1","minValue":"0","serviceMapping":"COORD_SYSTEM"}\
   ]}}
-
-# Added to fix Incompatible RWStore header version error
-com.bigdata.rwstore.RWStore.readBlobsAsync=false
docker compose restart wdqs
WARN[0000] /hd/delta/blazegraph/private-wikidata-query/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Restarting 1/1
 ✔ Container private-wikidata-query-wdqs-1  Started                      11.1s
Download multiple copies in parallel
The script below was named get and used to copy two /hd/delta and /hd/beta in parallel totalling 32 connections.
#!/bin/bash
# Corrected aria2 download script (no prealloc, true progress)
ISO_DATE=$(date +%Y%m%d)
DISK=beta
SESSION="bg_${DISK}_${ISO_DATE}"
FILENAME="data.jnl"
DOWNLOAD_DIR="/hd/$DISK/blazegraph"
URL="https://datasets.orbopengraph.com/blazegraph/data.jnl"
CONNECTIONS=16
# Kill stale session
if screen -list | grep -q "$SESSION"; then
  echo "⚠️ Killing existing session '$SESSION'..."
  screen -S "$SESSION" -X quit
fi
# Launch corrected aria2 download
screen -dmS "$SESSION" -L bash -c "
  echo '[INFO] Starting corrected download...';
  cd \"$DOWNLOAD_DIR\";
  aria2c -c -x $CONNECTIONS -s $CONNECTIONS \
    --file-allocation=none \
    --auto-file-renaming=false \
    --dir=\"$DOWNLOAD_DIR\" \
    --out=\"$FILENAME\" \
    \"$URL\";
  EXIT_CODE=\$?;
  if [ \$EXIT_CODE -eq 0 ]; then
    echo '[INFO] Download finished.';
    md5sum \"$FILENAME\" > \"$FILENAME.md5\";
    echo '[INFO] MD5 saved to $FILENAME.md5';
  else
    echo '[ERROR] Download failed with code \$EXIT_CODE';
  fi
"
echo "✅ Fixed download started in screen '$SESSION'."
echo "Monitor with: screen -r $SESSION"
md5 check
on source:
md5sum e891800af42b979159191487910bd9ae data.jnl
check script
at 95% progress
tail -f screenlog.0
FILE: /hd/delta/blazegraph/data.jnl
-------------------------------------------------------------------------------
 *** Download Progress Summary as of Mon May  5 07:36:28 2025 ***              
===============================================================================
[#152de7 1,241GiB/1,296GiB(95%) CN:16 DL:16MiB ETA:55m32s]
FILE: /hd/delta/blazegraph/data.jnl
-------------------------------------------------------------------------------
[#152de7 1,242GiB/1,296GiB(95%) CN:16 DL:21MiB ETA:43m39s]
./check 
Comparing:
  file1 → /hd/beta/blazegraph/data.jnl
  file2 → /hd/delta/blazegraph/data.jnl
  blocksize=1MB start=0MB
== log-5 ==
[  0]       1 MB  ✅  MD5 match
[  1]       5 MB  ✅  MD5 match
[  2]      25 MB  ✅  MD5 match
[  3]     125 MB  ✅  MD5 match
[  4]     625 MB  ✅  MD5 match
[  5]   3,125 MB  ✅  MD5 match
[  6]  15,625 MB  ✅  MD5 match
[  7]  78,125 MB  ✅  MD5 match
[  8] 390,625 MB  ✅  MD5 match
Summary: {'✅': 9}
== log-2 ==
100%|██████████████████████████████████████████| 21/21 [00:00<00:00, 212.25it/s]
Summary: {'✅': 21}
== linear-2000 ==
100%|████████████████████████████████████████| 663/663 [00:05<00:00, 116.57it/s]
Summary: {'✅': 607, '⚠️': 56}
== linear-500 ==
100%|██████████████████████████████████████| 2651/2651 [00:21<00:00, 122.85it/s]
Summary: {'✅': 2435, '⚠️': 216}
at 100 %
 tail  -15 screenlog.0 
[#152de7 1,296GiB/1,296GiB(99%) CN:7 DL:12MiB ETA:6s]
FILE: /hd/delta/blazegraph/data.jnl
-------------------------------------------------------------------------------
[#152de7 1,296GiB/1,296GiB(99%) CN:1 DL:1.2MiB ETA:1s]                         
05/05 08:20:15 [NOTICE] Download complete: /hd/delta/blazegraph/data.jnl
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
152de7|OK  |    19MiB/s|/hd/delta/blazegraph/data.jnl
Status Legend:
(OK):download completed.
[INFO] Download finished.
tail  -15 screenlog.0 
[#50fb05 1,296GiB/1,296GiB(99%) CN:13 DL:19MiB ETA:6s]
FILE: /hd/beta/blazegraph/data.jnl
-------------------------------------------------------------------------------
[#50fb05 1,296GiB/1,296GiB(99%) CN:1 DL:1.5MiB]                                
05/05 08:15:19 [NOTICE] Download complete: /hd/beta/blazegraph/data.jnl
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
50fb05|OK  |    18MiB/s|/hd/beta/blazegraph/data.jnl
Status Legend:
(OK):download completed.
[INFO] Download finished.
fix aria issues using blockdownload
see https://github.com/WolfgangFahl/blockdownload
check the aria downloaded journal file hashes - heads only
dcheck \
  --head-only \
  --url https://datasets.orbopengraph.com/blazegraph/data.jnl \
  /hd/beta/blazegraph/data.jnl \
  /hd/delta/blazegraph/data.jnl
Processing /hd/beta/blazegraph/data.jnl... (file size: 1296.92 GB)
Block 2656/2656 1296.88-1296.92 GB: : 1.39TB [00:00, 4.88TB/s]                                   
/hd/beta/blazegraph/data.jnl.yaml created with 2657 blocks (1328045.19 MB processed)
Processing /hd/delta/blazegraph/data.jnl... (file size: 1296.92 GB)
Block 2656/2656 1296.88-1296.92 GB: : 1.39TB [00:00, 4.77TB/s]                                   
/hd/delta/blazegraph/data.jnl.yaml created with 2657 blocks (1328045.19 MB processed)
2657✅ 0❌ 0⚠️: : 1.39TB [00:00, 6.80TB/s]                                                       ]
Final: 2657✅ 0❌ 0⚠️
check with full md5 500 MB block checks
dcheck \
  --url https://datasets.orbopengraph.com/blazegraph/data.jnl \
  /hd/beta/blazegraph/data.jnl \
  /hd/delta/blazegraph/data.jnl
Processing /hd/beta/blazegraph/data.jnl... (file size: 1296.92 GB)
Processing data.jnl:  62%|██████████████████████▏             | 857G/1.39T [26:34<16:24, 544MB/s]
grep block blazegraph.yaml  | wc -l
2652
check parts
md5sum blazegraph-2656.part 
493923964d5840438d9d06e560eaf15b  blazegraph-2656.part
wf@wikidata:/hd/eneco/blazegraph/blazegraph-2025-05-06$ grep 493923964d5840438d9d06e560eaf15b blazegraph.yaml -B2
  path: blazegraph-2656.part
  offset: 1392508928000
  md5: 493923964d5840438d9d06e560eaf15b
check missing blocks
grep block blazegraph.yaml | cut -f2 -d: | sort -un | awk 'NR==1{prev=$1; next} {for(i=prev+1;i<$1;i++) print i; prev=$1}'
2
4
7
30
131
209
642
reassemble
first try
blockdownload --yaml blazegraph.yaml --output data.jnl --progress --name wikidata https://datasets.orbopengraph.com/blazegraph/data.jnl .
Creating target: 100%|███████████████████▉| 1.39T/1.39T [1:34:08<00:11, 310MB/s]created data.jnl - 1324545.19 MB
md5: 2106c6ae22ff4425b35834ad7bb65b07
File reassembled successfully: data.jnl
Creating target: 100%|███████████████████▉| 1.39T/1.39T [1:34:09<00:14, 246MB/s]
wrong md5sum as expected with missing blocks
attempt 2025-05-23
cat bdy
# reassemble 
DATE="2025-05-23"
blockdownload https://datasets.orbopengraph.com/blazegraph/data.jnl \
  blazegraph-$DATE \
  --name blazegraph \
  --progress \
  --yaml blazegraph-$DATE/data.jnl.yaml \
  --output /hd/mantax/blazegraph/data.jnl
./bdy
Creating target: 100%|██████████| 1.39T/1.39T [1:00:24<00:00, 384MB/s]
created /hd/mantax/blazegraph/data.jnl - 1328045.19 MB
md5: 1f6a822d9015aa3356714d304eeb8471
File reassembled successfully: /hd/mantax/blazegraph/data.jnl
wf@wikidata:/hd/eneco/blazegraph/blazegraph-2025-05-23$ head -1 md5sums.txt 
1f6a822d9015aa3356714d304eeb8471  data.jnl
ls -l /hd/mantax/blazegraph/data.jnl 
-rw-rw-r-- 1 wf wf 1392556310528 May 24 13:39 data.jnl
md5sum /hd/mantax/blazegraph/data.jnl 
1f6a822d9015aa3356714d304eeb8471 data.jnl
follow-up 2025-05-31
All the following was done in tmux session 0 under the `jh` user.
With MD5 checksum verified, launch with Docker Compose and launch a bash shell for the `wdqs` container (the one running Blazegraph).
docker compose up --build -d
docker compose exec wdqs bash
Inside the container, run this command to catch the database up to the present day:
while true; do /wdqs/runUpdate.sh; sleep 10; done
This creates an infinite loop that will continuously re-start the update script if it crashes for any reason. In most cases, script crashes can be overcome by simply restarting the script. The `sleep 10` is so you are not spammed with restart attempts.
As of 05:05, 31 May 2025 (CEST) the data.jnl on /hd/delta is updated through the end of 2025-05-15. Wait a few days for it to catch up to present.
Next steps:
- Verify frontend can connect to query service
- Modify the retry loop one-liner to include logging of error events
- Formalize the one-liner into a script or Docker Compose configuration
followup 2025-06-03
- created apache config wdqs.conf with proxy to 8099
- started frontend
root@wikidata:/hd/delta/blazegraph/private-wikidata-query#docker compose ps
WARN[0000] /hd/delta/blazegraph/private-wikidata-query/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
NAME                                  IMAGE     COMMAND                  SERVICE      CREATED       STATUS        PORTS
private-wikidata-query-wdqs-proxy-1   caddy     "caddy run --config …"   wdqs-proxy   4 weeks ago   Up 11 hours   80/tcp, 443/tcp, 2019/tcp, 443/udp
root@wikidata:/hd/delta/blazegraph/private-wikidata-query# docker compose up -d
WARN[0000] /hd/delta/blazegraph/private-wikidata-query/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Running 3/3
 ✔ Container private-wikidata-query-wdqs-1           Started                               0.3s 
 ✔ Container private-wikidata-query-wdqs-proxy-1     Running                               0.0s 
 ✔ Container private-wikidata-query-wdqs-frontend-1  Started                               0.7s 
root@wikidata:/hd/delta/blazegraph/private-wikidata-query# curl http://localhost:8099
<!DOCTYPE html><html lang="en" dir="ltr"><head><meta charset="utf-8"
followup 2025-06-04
see also https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
while true; do /wdqs/runUpdate.sh; sleep 10; done
15:33:59.522 [main] INFO org.wikidata.query.rdf.tool.Updater - Polled up to 2025-06-01T21:10:22Z (next: 20250601211022|2427659504) at (10.6, 6.5, 2.9) updates per second and (3412.9, 2094.7, 941.5) milliseconds per second 15:33:59.770 [main] INFO o.w.q.r.t.change.RecentChangesPoller - Got 81 changes, from Q134675408@2355787733@20250601211022|2427659504 to Q127768397@2355787831@20250601211035|2427659605 15:34:08.856 [main] INFO org.wikidata.query.rdf.tool.Updater - Polled up to 2025-06-01T21:10:35Z (next: 20250601211035|2427659606) at (10.0, 6.5, 3.0) updates per second and (3095.0, 2071.5, 946.5) milliseconds per second 15:34:09.096 [main] INFO o.w.q.r.t.change.RecentChangesPoller - Got 76 changes, from Q134675444@2355787832@20250601211035|2427659606 to Q134642638@2355787935@20250601211052|2427659706 15:34:13.024 [main] INFO org.wikidata.query.rdf.tool.Updater - Polled up to 2025-06-01T21:10:52Z (next: 20250601211052|2427659707) at (10.3, 6.6, 3.0) updates per second and (3055.4, 2080.3, 955.7) milliseconds per second 15:34:13.268 [main] INFO o.w.q.r.t.change.RecentChangesPoller - Got 78 changes, from Q134675474@2355787931@20250601211052|2427659707 to Q95527928@2355788033@20250601211106|2427659806
# check updated state of copy of wikidata
#
# see https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
# and https://wiki.bitplan.com/index.php/Wikidata_Import_2025-05-02#followup_2025-06-04
PREFIX schema: <http://schema.org/>
# This query returns:
# - the total number of triples in the dataset
# - the dateModified of the <http://www.wikidata.org> entity, if available
SELECT * WHERE {
  {
    # Subquery: count all triples in the dataset
    SELECT (COUNT(*) AS ?count) {
      ?s ?p ?o
    }
  }
  UNION
  {
    # Subquery: get the schema:dateModified for the Wikidata root URI
    SELECT * WHERE {
      <http://www.wikidata.org> schema:dateModified ?y
    }
  }
}