Difference between revisions of "Wikidata Import 2025-05-02"
James Hare (talk | contribs) |
James Hare (talk | contribs) |
||
Line 312: | Line 312: | ||
</source> | </source> | ||
− | This creates an infinite loop that will continuously re-start the update script if it crashes for any reason. In most cases, crashes can be overcome by simply restarting the | + | This creates an infinite loop that will continuously re-start the update script if it crashes for any reason. In most cases, script crashes can be overcome by simply restarting the script. The `sleep 10` is so you are not spammed with restart attempts. |
As of 05:05, 31 May 2025 (CEST) the data.jnl on /hd/delta is updated through the end of 2025-05-15. Wait a few days for it to catch up to present. | As of 05:05, 31 May 2025 (CEST) the data.jnl on /hd/delta is updated through the end of 2025-05-15. Wait a few days for it to catch up to present. |
Revision as of 05:07, 31 May 2025
Import
Import | |
---|---|
state | |
url | https://wiki.bitplan.com/index.php/Wikidata_Import_2025-05-02 |
target | blazegraph |
start | 2025-05-02 |
end | 2025-05-03 |
days | 0.6 |
os | Ubuntu 22.04.3 LTS |
cpu | Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz (16 cores) |
ram | 512 |
triples | |
comment |
This "import" is not using a dump and indexing approach but directly copying a blazegraph journal file.
Steps
Copy journal file
Source https://scatter.red/ wikidata installation. Usimng aria2c with 16 connections the copy initially took some 5 hours but was interrrupted. Since aria2c was used in preallocation mode and the script final message was "download finished" the file looked complete which it was not.
git clone the priv-wd-query
git clone https://github.com/scatter-llc/private-wikidata-query
mkdir data
mv data.jnl private-wikidata-query/data
cd private-wikidata-query/data
# use proper uid and gid as per the containers preferences
chown 666:66 data.jnl
jh@wikidata:/hd/delta/blazegraph/private-wikidata-query/data$ ls -l
total 346081076
-rw-rw-r-- 1 666 66 1328514809856 May 2 22:07 data.jnl
start docker
docker compose up -d
WARN[0000] /hd/delta/blazegraph/private-wikidata-query/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion
[+] Running 3/3
✔ Container private-wikidata-query-wdqs-1 Started 0.4s
✔ Container private-wikidata-query-wdqs-proxy-1 Started 0.7s
✔ Container private-wikidata-query-wdqs-frontend-1 Started 1.1s
docker ps | grep wdqs
36dad88ebfdc wikibase/wdqs-frontend:wmde.11 "/entrypoint.sh ngin…" About an hour ago Up 3 minutes 0.0.0.0:8099->80/tcp, [::]:8099->80/tcp private-wikidata-query-wdqs-frontend-1
f0d273cca376 caddy "caddy run --config …" About an hour ago Up 3 minutes 80/tcp, 443/tcp, 2019/tcp, 443/udp private-wikidata-query-wdqs-proxy-1
d86124984e0f wikibase/wdqs:0.3.97-wmde.8 "/entrypoint.sh /run…" About an hour ago Up 3 minutes 0.0.0.0:9999->9999/tcp, [::]:9999->9999/tcp private-wikidata-query-wdqs-1
6011f5c1cc03 caddy "caddy run --config …" 12 months ago Up 3 days 80/tcp, 443/tcp, 2019/tcp, 443/udp wdqs-wdqs-proxy-1
Incompatible RWStore header version
docker logs private-wikidata-query-wdqs-1 2>&1 | grep -m 1 "Incompatible RWStore header version"
java.lang.RuntimeException: java.lang.IllegalStateException: Incompatible RWStore header version: storeVersion=0, cVersion=1024, demispace: true
docker exec -it private-wikidata-query-wdqs-1 /bin/bash
diff RWStore.properties RWStore.properties.bak-20250503
--- RWStore.properties
+++ RWStore.properties.bak-20250503
@@ -56,6 +56,3 @@
{"valueType":"DOUBLE","multiplier":"1000000000","serviceMapping":"LATITUDE"},\
{"valueType":"LONG","multiplier":"1","minValue":"0","serviceMapping":"COORD_SYSTEM"}\
]}}
-
-# Added to fix Incompatible RWStore header version error
-com.bigdata.rwstore.RWStore.readBlobsAsync=false
docker compose restart wdqs
WARN[0000] /hd/delta/blazegraph/private-wikidata-query/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion
[+] Restarting 1/1
✔ Container private-wikidata-query-wdqs-1 Started 11.1s
Download multiple copies in parallel
The script below was named get and used to copy two /hd/delta and /hd/beta in parallel totalling 32 connections.
#!/bin/bash
# Corrected aria2 download script (no prealloc, true progress)
ISO_DATE=$(date +%Y%m%d)
DISK=beta
SESSION="bg_${DISK}_${ISO_DATE}"
FILENAME="data.jnl"
DOWNLOAD_DIR="/hd/$DISK/blazegraph"
URL="https://datasets.orbopengraph.com/blazegraph/data.jnl"
CONNECTIONS=16
# Kill stale session
if screen -list | grep -q "$SESSION"; then
echo "⚠️ Killing existing session '$SESSION'..."
screen -S "$SESSION" -X quit
fi
# Launch corrected aria2 download
screen -dmS "$SESSION" -L bash -c "
echo '[INFO] Starting corrected download...';
cd \"$DOWNLOAD_DIR\";
aria2c -c -x $CONNECTIONS -s $CONNECTIONS \
--file-allocation=none \
--auto-file-renaming=false \
--dir=\"$DOWNLOAD_DIR\" \
--out=\"$FILENAME\" \
\"$URL\";
EXIT_CODE=\$?;
if [ \$EXIT_CODE -eq 0 ]; then
echo '[INFO] Download finished.';
md5sum \"$FILENAME\" > \"$FILENAME.md5\";
echo '[INFO] MD5 saved to $FILENAME.md5';
else
echo '[ERROR] Download failed with code \$EXIT_CODE';
fi
"
echo "✅ Fixed download started in screen '$SESSION'."
echo "Monitor with: screen -r $SESSION"
md5 check
on source:
md5sum e891800af42b979159191487910bd9ae data.jnl
check script
at 95% progress
tail -f screenlog.0
FILE: /hd/delta/blazegraph/data.jnl
-------------------------------------------------------------------------------
*** Download Progress Summary as of Mon May 5 07:36:28 2025 ***
===============================================================================
[#152de7 1,241GiB/1,296GiB(95%) CN:16 DL:16MiB ETA:55m32s]
FILE: /hd/delta/blazegraph/data.jnl
-------------------------------------------------------------------------------
[#152de7 1,242GiB/1,296GiB(95%) CN:16 DL:21MiB ETA:43m39s]
./check
Comparing:
file1 → /hd/beta/blazegraph/data.jnl
file2 → /hd/delta/blazegraph/data.jnl
blocksize=1MB start=0MB
== log-5 ==
[ 0] 1 MB ✅ MD5 match
[ 1] 5 MB ✅ MD5 match
[ 2] 25 MB ✅ MD5 match
[ 3] 125 MB ✅ MD5 match
[ 4] 625 MB ✅ MD5 match
[ 5] 3,125 MB ✅ MD5 match
[ 6] 15,625 MB ✅ MD5 match
[ 7] 78,125 MB ✅ MD5 match
[ 8] 390,625 MB ✅ MD5 match
Summary: {'✅': 9}
== log-2 ==
100%|██████████████████████████████████████████| 21/21 [00:00<00:00, 212.25it/s]
Summary: {'✅': 21}
== linear-2000 ==
100%|████████████████████████████████████████| 663/663 [00:05<00:00, 116.57it/s]
Summary: {'✅': 607, '⚠️': 56}
== linear-500 ==
100%|██████████████████████████████████████| 2651/2651 [00:21<00:00, 122.85it/s]
Summary: {'✅': 2435, '⚠️': 216}
at 100 %
tail -15 screenlog.0
[#152de7 1,296GiB/1,296GiB(99%) CN:7 DL:12MiB ETA:6s]
FILE: /hd/delta/blazegraph/data.jnl
-------------------------------------------------------------------------------
[#152de7 1,296GiB/1,296GiB(99%) CN:1 DL:1.2MiB ETA:1s]
05/05 08:20:15 [NOTICE] Download complete: /hd/delta/blazegraph/data.jnl
Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
152de7|OK | 19MiB/s|/hd/delta/blazegraph/data.jnl
Status Legend:
(OK):download completed.
[INFO] Download finished.
tail -15 screenlog.0
[#50fb05 1,296GiB/1,296GiB(99%) CN:13 DL:19MiB ETA:6s]
FILE: /hd/beta/blazegraph/data.jnl
-------------------------------------------------------------------------------
[#50fb05 1,296GiB/1,296GiB(99%) CN:1 DL:1.5MiB]
05/05 08:15:19 [NOTICE] Download complete: /hd/beta/blazegraph/data.jnl
Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
50fb05|OK | 18MiB/s|/hd/beta/blazegraph/data.jnl
Status Legend:
(OK):download completed.
[INFO] Download finished.
fix aria issues using blockdownload
see https://github.com/WolfgangFahl/blockdownload
check the aria downloaded journal file hashes - heads only
dcheck \
--head-only \
--url https://datasets.orbopengraph.com/blazegraph/data.jnl \
/hd/beta/blazegraph/data.jnl \
/hd/delta/blazegraph/data.jnl
Processing /hd/beta/blazegraph/data.jnl... (file size: 1296.92 GB)
Block 2656/2656 1296.88-1296.92 GB: : 1.39TB [00:00, 4.88TB/s]
/hd/beta/blazegraph/data.jnl.yaml created with 2657 blocks (1328045.19 MB processed)
Processing /hd/delta/blazegraph/data.jnl... (file size: 1296.92 GB)
Block 2656/2656 1296.88-1296.92 GB: : 1.39TB [00:00, 4.77TB/s]
/hd/delta/blazegraph/data.jnl.yaml created with 2657 blocks (1328045.19 MB processed)
2657✅ 0❌ 0⚠️: : 1.39TB [00:00, 6.80TB/s] ]
Final: 2657✅ 0❌ 0⚠️
check with full md5 500 MB block checks
dcheck \
--url https://datasets.orbopengraph.com/blazegraph/data.jnl \
/hd/beta/blazegraph/data.jnl \
/hd/delta/blazegraph/data.jnl
Processing /hd/beta/blazegraph/data.jnl... (file size: 1296.92 GB)
Processing data.jnl: 62%|██████████████████████▏ | 857G/1.39T [26:34<16:24, 544MB/s]
grep block blazegraph.yaml | wc -l
2652
check parts
md5sum blazegraph-2656.part
493923964d5840438d9d06e560eaf15b blazegraph-2656.part
wf@wikidata:/hd/eneco/blazegraph/blazegraph-2025-05-06$ grep 493923964d5840438d9d06e560eaf15b blazegraph.yaml -B2
path: blazegraph-2656.part
offset: 1392508928000
md5: 493923964d5840438d9d06e560eaf15b
check missing blocks
grep block blazegraph.yaml | cut -f2 -d: | sort -un | awk 'NR==1{prev=$1; next} {for(i=prev+1;i<$1;i++) print i; prev=$1}'
2
4
7
30
131
209
642
reassemble
first try
blockdownload --yaml blazegraph.yaml --output data.jnl --progress --name wikidata https://datasets.orbopengraph.com/blazegraph/data.jnl .
Creating target: 100%|███████████████████▉| 1.39T/1.39T [1:34:08<00:11, 310MB/s]created data.jnl - 1324545.19 MB
md5: 2106c6ae22ff4425b35834ad7bb65b07
File reassembled successfully: data.jnl
Creating target: 100%|███████████████████▉| 1.39T/1.39T [1:34:09<00:14, 246MB/s]
wrong md5sum as expected with missing blocks
attempt 2025-05-23
cat bdy
# reassemble
DATE="2025-05-23"
blockdownload https://datasets.orbopengraph.com/blazegraph/data.jnl \
blazegraph-$DATE \
--name blazegraph \
--progress \
--yaml blazegraph-$DATE/data.jnl.yaml \
--output /hd/mantax/blazegraph/data.jnl
./bdy
Creating target: 100%|██████████| 1.39T/1.39T [1:00:24<00:00, 384MB/s]
created /hd/mantax/blazegraph/data.jnl - 1328045.19 MB
md5: 1f6a822d9015aa3356714d304eeb8471
File reassembled successfully: /hd/mantax/blazegraph/data.jnl
wf@wikidata:/hd/eneco/blazegraph/blazegraph-2025-05-23$ head -1 md5sums.txt
1f6a822d9015aa3356714d304eeb8471 data.jnl
ls -l /hd/mantax/blazegraph/data.jnl
-rw-rw-r-- 1 wf wf 1392556310528 May 24 13:39 data.jnl
md5sum /hd/mantax/blazegraph/data.jnl
1f6a822d9015aa3356714d304eeb8471 data.jnl
follow-up 2025-05-31
All the following was done in tmux session 0 under the `jh` user.
With MD5 checksum verified, launch with Docker Compose and launch a bash shell for the `wdqs` container (the one running Blazegraph).
docker compose up --build -d
docker compose exec wdqs bash
Inside the container, run this command to catch the database up to the present day:
while true; do /wdqs/runUpdate.sh; sleep 10; done
This creates an infinite loop that will continuously re-start the update script if it crashes for any reason. In most cases, script crashes can be overcome by simply restarting the script. The `sleep 10` is so you are not spammed with restart attempts.
As of 05:05, 31 May 2025 (CEST) the data.jnl on /hd/delta is updated through the end of 2025-05-15. Wait a few days for it to catch up to present.
Next steps:
- Verify frontend can connect to query service
- Modify the retry loop one-liner to include logging of error events
- Formalize the one-liner into a script or Docker Compose configuration