Difference between revisions of "WikiData Import 2020-09-11"

From BITPlan Wiki
Jump to navigation Jump to search
 
(24 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
replicate original blazegraph environment
 
replicate original blazegraph environment
 
see https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md
 
see https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md
== Git clone ===java
+
== Git clone ==
 
<source lang='bash' highlight='1'>
 
<source lang='bash' highlight='1'>
 
git clone  --recurse-submodules https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf
 
git clone  --recurse-submodules https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf
Line 15: Line 15:
 
</source>
 
</source>
 
== mvn package ==
 
== mvn package ==
<source lang='bash' highlight='2-3,6-8,14'>
+
<source lang='bash' highlight='2-3,7-8,14'>
 
# Mac OS X  
 
# Mac OS X  
 
export JAVA_HOME=$(/usr/libexec/java_home -v 1.8.0_191)
 
export JAVA_HOME=$(/usr/libexec/java_home -v 1.8.0_191)
Line 30: Line 30:
 
OS name: "mac os x", version: "10.13.6", arch: "x86_64", family: "mac"
 
OS name: "mac os x", version: "10.13.6", arch: "x86_64", family: "mac"
 
mvn package
 
mvn package
 +
</source>
 +
=== issue ===
 +
see https://phabricator.wikimedia.org/T263855
 +
<source lang='bash'>
 +
2020-09-11 13:45:39 INFO  RocksDBStateBackend:899 - Attempting to load RocksDB native library and store it under '/var/folders/2t/2g54bjr10830rv00508_y13w0000gn/T/flink-io-6ec39247-613e-410a-a83f-712f841ce3a8'
 +
dyld: lazy symbol binding failed: Symbol not found: ____chkstk_darwin
 +
  Referenced from: /private/var/folders/2t/2g54bjr10830rv00508_y13w0000gn/T/flink-io-6ec39247-613e-410a-a83f-712f841ce3a8/rocksdb-lib-396871de50f5fa7595c1071b59c34498/librocksdbjni-osx.jnilib (which was built for Mac OS X 10.15)
 +
  Expected in: /usr/lib/libSystem.B.dylib
 +
 +
dyld: Symbol not found: ____chkstk_darwin
 +
  Referenced from: /private/var/folders/2t/2g54bjr10830rv00508_y13w0000gn/T/flink-io-6ec39247-613e-410a-a83f-712f841ce3a8/rocksdb-lib-396871de50f5fa7595c1071b59c34498/librocksdbjni-osx.jnilib (which was built for Mac OS X 10.15)
 +
  Expected in: /usr/lib/libSystem.B.dylib
 +
</source>
 +
== Try workaround ==
 +
<source lang='bash'>
 +
mvn package -D skipTests
 
</source>
 
</source>
 
[[Category:WikiData]]
 
[[Category:WikiData]]
 +
 +
== service unzip ==
 +
Note the usage of. tar x instead of unzip
 +
<source lang='bash' highlight='1-4'>
 +
cd dist/target/
 +
tar xvfz service-0.3.50-SNAPSHOT-dist.tar.gz
 +
cd service-0.3.50-SNAPSHOT
 +
ls
 +
RWStore.properties                    loadRestAPI.sh
 +
blazegraph-service-0.3.50-SNAPSHOT.war munge.sh
 +
createNamespace.sh                    mw-oauth-proxy-0.3.50-SNAPSHOT.war
 +
default.properties                    mwservices.json
 +
docs                                  prefixes-sdc.conf
 +
forAllCategoryWikis.sh                prefixes.conf
 +
jetty-runner-9.4.12.v20180830.jar      runBlazegraph.sh
 +
ldf-config.json                        runStreamingUpdater.sh
 +
lib                                    runUpdate.sh
 +
loadCategoryDaily.sh                  summarizeEvents.sh
 +
loadCategoryDump.sh                    wcqs-data-reload.sh
 +
loadData.sh
 +
</source>
 +
== download ==
 +
The download of some 275 GB took 16h in total
 +
=== first try ===
 +
<source lang='bash' highlight='1'>
 +
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
 +
--2020-09-25 17:29:02--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
 +
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
 +
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
 +
HTTP request sent, awaiting response... 200 OK
 +
Length: 275358642076 (256G) [application/octet-stream]
 +
Saving to: ‘latest-all.ttl.gz’
 +
 +
latest-all.ttl.gz    16%[==>                ]  43.12G  4.88MB/s    in 2h 39m 
 +
 +
2020-09-25 20:23:55 (4.60 MB/s) - Read error at byte 46303657641/275358642076 (Resource temporarily unavailable, try again.). Retrying.
 +
 +
--2020-09-25 20:23:56--  (try: 2)  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
 +
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
 +
HTTP request sent, awaiting response... 206 Partial Content
 +
Length: 275358642076 (256G), 229054984435 (213G) remaining [application/octet-stream]
 +
Saving to: ‘latest-all.ttl.gz’
 +
 +
latest-all.ttl.gz  100%[+++================>] 256.45G  5.01MB/s    in 13h 14m
 +
 +
2020-09-26 09:38:36 (4.58 MB/s) - ‘latest-all.ttl.gz’ saved [275358642076/275358642076]
 +
</source>
 +
=== second try ===
 +
<source lang='bash' highlight='1'>
 +
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
 +
--2020-09-27 07:51:51--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
 +
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
 +
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
 +
HTTP request sent, awaiting response... 200 OK
 +
Length: 275358642076 (256G) [application/octet-stream]
 +
Saving to: ‘latest-all.ttl.gz’
 +
 +
latest-all.ttl.gz  100%[===================>] 256.45G  4.65MB/s    in 16h 4m 
 +
 +
2020-09-27 23:56:45 (4.54 MB/s) - ‘latest-all.ttl.gz’ saved [275358642076/275358642076]
 +
 +
</source>
 +
 +
= munge =
 +
<source lang='bash' highlight='1'>
 +
nohup ./munge.sh -f data/latest-all.ttl.gz -d data/split -l en,de &
 +
</source>
 +
== Issue ==
 +
=== Problem ===
 +
<pre>
 +
tail -f nohup.out
 +
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
 +
07:44:31.400 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  org.wikidata.query.rdf.tool.Munge - Switching to data/split/wikidump-000000001.ttl.gz
 +
07:44:37.263 [main] ERROR org.wikidata.query.rdf.tool.Munge - Fatal error munging RDF
 +
org.openrdf.rio.RDFParseException: Expected '.', found 's' [line 595492]
 +
at org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:440)
 +
at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:685)
 +
at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1405)
 +
at org.openrdf.rio.turtle.TurtleParser.verifyCharacterOrFail(TurtleParser.java:1227)
 +
at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:261)
 +
at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:214)
 +
at org.wikidata.query.rdf.tool.Munge.run(Munge.java:105)
 +
at org.wikidata.query.rdf.tool.Munge.main(Munge.java:59)
 +
</pre>
 +
=== Diagnosis ===
 +
* https://phabricator.wikimedia.org/T178211
 +
* https://stackoverflow.com/questions/48020506/wikidata-on-local-blazegraph-expected-an-rdf-value-here-found-line-1
 +
* try downloading again: problem persists
 +
* https://stackoverflow.com/questions/8151380/how-to-get-few-lines-from-a-gz-compressed-file-without-uncompressing
 +
* [https://stackoverflow.com/a/64131082/1497139 awk one-liner to get a line range from a gzipped file]
 +
 +
=== Therapy ===
 +
* report https://phabricator.wikimedia.org/T263969 and wait for response

Latest revision as of 05:53, 30 September 2020

Goal

replicate original blazegraph environment see https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md

Git clone

git clone  --recurse-submodules https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf
Cloning into 'wikidata-query-rdf'...
remote: Counting objects: 195, done
remote: Finding sources: 100% (161/161)
remote: Getting sizes: 100% (146/146)
remote: Compressing objects: 100% (131311/131311)
remote: Total 17957 (delta 17), reused 17930 (delta 9)
Receiving objects: 100% (17957/17957), 2.98 MiB | 3.37 MiB/s, done.
Resolving deltas: 100% (8953/8953), done.

mvn package

# Mac OS X 
export JAVA_HOME=$(/usr/libexec/java_home -v 1.8.0_191)
java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
cd wikidata-query-rdf
mvn -version
Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 2018-10-24T20:41:47+02:00)
Maven home: /opt/local/share/java/maven3
Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: /Library/Java/JavaVirtualMachines/jdk1.8.0_191.jdk/Contents/Home/jre
Default locale: de_DE, platform encoding: UTF-8
OS name: "mac os x", version: "10.13.6", arch: "x86_64", family: "mac"
mvn package

issue

see https://phabricator.wikimedia.org/T263855

2020-09-11 13:45:39 INFO  RocksDBStateBackend:899 - Attempting to load RocksDB native library and store it under '/var/folders/2t/2g54bjr10830rv00508_y13w0000gn/T/flink-io-6ec39247-613e-410a-a83f-712f841ce3a8'
dyld: lazy symbol binding failed: Symbol not found: ____chkstk_darwin
  Referenced from: /private/var/folders/2t/2g54bjr10830rv00508_y13w0000gn/T/flink-io-6ec39247-613e-410a-a83f-712f841ce3a8/rocksdb-lib-396871de50f5fa7595c1071b59c34498/librocksdbjni-osx.jnilib (which was built for Mac OS X 10.15)
  Expected in: /usr/lib/libSystem.B.dylib

dyld: Symbol not found: ____chkstk_darwin
  Referenced from: /private/var/folders/2t/2g54bjr10830rv00508_y13w0000gn/T/flink-io-6ec39247-613e-410a-a83f-712f841ce3a8/rocksdb-lib-396871de50f5fa7595c1071b59c34498/librocksdbjni-osx.jnilib (which was built for Mac OS X 10.15)
  Expected in: /usr/lib/libSystem.B.dylib

Try workaround

mvn package -D skipTests

service unzip

Note the usage of. tar x instead of unzip

cd dist/target/
tar xvfz service-0.3.50-SNAPSHOT-dist.tar.gz
cd service-0.3.50-SNAPSHOT
ls
RWStore.properties                     loadRestAPI.sh
blazegraph-service-0.3.50-SNAPSHOT.war munge.sh
createNamespace.sh                     mw-oauth-proxy-0.3.50-SNAPSHOT.war
default.properties                     mwservices.json
docs                                   prefixes-sdc.conf
forAllCategoryWikis.sh                 prefixes.conf
jetty-runner-9.4.12.v20180830.jar      runBlazegraph.sh
ldf-config.json                        runStreamingUpdater.sh
lib                                    runUpdate.sh
loadCategoryDaily.sh                   summarizeEvents.sh
loadCategoryDump.sh                    wcqs-data-reload.sh
loadData.sh

download

The download of some 275 GB took 16h in total

first try

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
--2020-09-25 17:29:02--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 275358642076 (256G) [application/octet-stream]
Saving to: ‘latest-all.ttl.gz’

latest-all.ttl.gz    16%[==>                 ]  43.12G  4.88MB/s    in 2h 39m  

2020-09-25 20:23:55 (4.60 MB/s) - Read error at byte 46303657641/275358642076 (Resource temporarily unavailable, try again.). Retrying.

--2020-09-25 20:23:56--  (try: 2)  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 275358642076 (256G), 229054984435 (213G) remaining [application/octet-stream]
Saving to: ‘latest-all.ttl.gz’

latest-all.ttl.gz   100%[+++================>] 256.45G  5.01MB/s    in 13h 14m 

2020-09-26 09:38:36 (4.58 MB/s) - ‘latest-all.ttl.gz’ saved [275358642076/275358642076]

second try

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
--2020-09-27 07:51:51--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 275358642076 (256G) [application/octet-stream]
Saving to: ‘latest-all.ttl.gz’

latest-all.ttl.gz   100%[===================>] 256.45G  4.65MB/s    in 16h 4m  

2020-09-27 23:56:45 (4.54 MB/s) - ‘latest-all.ttl.gz’ saved [275358642076/275358642076]

munge

nohup ./munge.sh -f data/latest-all.ttl.gz -d data/split -l en,de &

Issue

Problem

tail -f nohup.out 
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
07:44:31.400 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  org.wikidata.query.rdf.tool.Munge - Switching to data/split/wikidump-000000001.ttl.gz
07:44:37.263 [main] ERROR org.wikidata.query.rdf.tool.Munge - Fatal error munging RDF
org.openrdf.rio.RDFParseException: Expected '.', found 's' [line 595492]
	at org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:440)
	at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:685)
	at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1405)
	at org.openrdf.rio.turtle.TurtleParser.verifyCharacterOrFail(TurtleParser.java:1227)
	at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:261)
	at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:214)
	at org.wikidata.query.rdf.tool.Munge.run(Munge.java:105)
	at org.wikidata.query.rdf.tool.Munge.main(Munge.java:59)

Diagnosis

Therapy