Difference between revisions of "Wikidata Import 2025-05-02"

From BITPlan Wiki
Jump to navigation Jump to search
Line 73: Line 73:
 
</source>
 
</source>
 
== Download multiple copies in parallel ==
 
== Download multiple copies in parallel ==
The script below was named get and use to copy two /hd/delta and /hd/beta in parallel totalling 32 connections.   
+
The script below was named get and used to copy two /hd/delta and /hd/beta in parallel totalling 32 connections.   
 
<source lang='bash'>
 
<source lang='bash'>
 
#!/bin/bash
 
#!/bin/bash

Revision as of 06:23, 5 May 2025

Import

Import
edit
state  
url  https://wiki.bitplan.com/index.php/Wikidata_Import_2025-05-02
target  blazegraph
start  2025-05-02
end  2025-05-03
days  0.6
os  Ubuntu 22.04.3 LTS
cpu  Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz (16 cores)
ram  512
triples  
comment  


This "import" is not using a dump and indexing approach but directly copying a blazegraph journal file.

Steps

Copy journal file

Source https://scatter.red/ wikidata installation. Usimng aria2c with 16 connections the copy initially took some 5 hours but was interrrupted. Since aria2c was used in preallocation mode and the script final message was "download finished" the file looked complete which it was not.

git clone the priv-wd-query

git clone https://github.com/scatter-llc/private-wikidata-query
mkdir data
mv data.jnl private-wikidata-query/data
cd private-wikidata-query/data
# use proper uid and gid as per the containers preferences
chown 666:66 data.jnl
jh@wikidata:/hd/delta/blazegraph/private-wikidata-query/data$ ls -l
total 346081076
-rw-rw-r-- 1 666 66 1328514809856 May  2 22:07 data.jnl

start docker

docker compose up -d
WARN[0000] /hd/delta/blazegraph/private-wikidata-query/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Running 3/3
 ✔ Container private-wikidata-query-wdqs-1           Started               0.4s 
 ✔ Container private-wikidata-query-wdqs-proxy-1     Started               0.7s 
 ✔ Container private-wikidata-query-wdqs-frontend-1  Started               1.1s
docker ps | grep wdqs
36dad88ebfdc   wikibase/wdqs-frontend:wmde.11      "/entrypoint.sh ngin…"   About an hour ago   Up 3 minutes                    0.0.0.0:8099->80/tcp, [::]:8099->80/tcp                           private-wikidata-query-wdqs-frontend-1
f0d273cca376   caddy                               "caddy run --config …"   About an hour ago   Up 3 minutes                    80/tcp, 443/tcp, 2019/tcp, 443/udp                                private-wikidata-query-wdqs-proxy-1
d86124984e0f   wikibase/wdqs:0.3.97-wmde.8         "/entrypoint.sh /run…"   About an hour ago   Up 3 minutes                    0.0.0.0:9999->9999/tcp, [::]:9999->9999/tcp                       private-wikidata-query-wdqs-1
6011f5c1cc03   caddy                               "caddy run --config …"   12 months ago       Up 3 days                       80/tcp, 443/tcp, 2019/tcp, 443/udp                                wdqs-wdqs-proxy-1

Incompatible RWStore header version

see https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/rwstore/RWStore.java

docker logs private-wikidata-query-wdqs-1 2>&1 | grep -m 1 "Incompatible RWStore header version"
java.lang.RuntimeException: java.lang.IllegalStateException: Incompatible RWStore header version: storeVersion=0, cVersion=1024, demispace: true
docker exec -it private-wikidata-query-wdqs-1 /bin/bash
diff RWStore.properties RWStore.properties.bak-20250503 
--- RWStore.properties
+++ RWStore.properties.bak-20250503
@@ -56,6 +56,3 @@
    {"valueType":"DOUBLE","multiplier":"1000000000","serviceMapping":"LATITUDE"},\
    {"valueType":"LONG","multiplier":"1","minValue":"0","serviceMapping":"COORD_SYSTEM"}\
   ]}}
-
-# Added to fix Incompatible RWStore header version error
-com.bigdata.rwstore.RWStore.readBlobsAsync=false
docker compose restart wdqs
WARN[0000] /hd/delta/blazegraph/private-wikidata-query/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Restarting 1/1
 ✔ Container private-wikidata-query-wdqs-1  Started                      11.1s

Download multiple copies in parallel

The script below was named get and used to copy two /hd/delta and /hd/beta in parallel totalling 32 connections.

#!/bin/bash
# Corrected aria2 download script (no prealloc, true progress)

ISO_DATE=$(date +%Y%m%d)
DISK=beta
SESSION="bg_${DISK}_${ISO_DATE}"
FILENAME="data.jnl"
DOWNLOAD_DIR="/hd/$DISK/blazegraph"
URL="https://datasets.orbopengraph.com/blazegraph/data.jnl"
CONNECTIONS=16

# Kill stale session
if screen -list | grep -q "$SESSION"; then
  echo "⚠️ Killing existing session '$SESSION'..."
  screen -S "$SESSION" -X quit
fi

# Launch corrected aria2 download
screen -dmS "$SESSION" -L bash -c "
  echo '[INFO] Starting corrected download...';
  cd \"$DOWNLOAD_DIR\";
  aria2c -c -x $CONNECTIONS -s $CONNECTIONS \
    --file-allocation=none \
    --auto-file-renaming=false \
    --dir=\"$DOWNLOAD_DIR\" \
    --out=\"$FILENAME\" \
    \"$URL\";
  EXIT_CODE=\$?;
  if [ \$EXIT_CODE -eq 0 ]; then
    echo '[INFO] Download finished.';
    md5sum \"$FILENAME\" > \"$FILENAME.md5\";
    echo '[INFO] MD5 saved to $FILENAME.md5';
  else
    echo '[ERROR] Download failed with code \$EXIT_CODE';
  fi
"

echo "✅ Fixed download started in screen '$SESSION'."
echo "Monitor with: screen -r $SESSION"