Difference between revisions of "WikiData Import 2020-09-11"
Jump to navigation
Jump to search
(Created page with "= Goal = replicate original blazegraph environment see https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md <source lang='bash'> git clone --re...") |
|||
(30 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
replicate original blazegraph environment | replicate original blazegraph environment | ||
see https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md | see https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md | ||
− | <source lang='bash'> | + | == Git clone == |
+ | <source lang='bash' highlight='1'> | ||
git clone --recurse-submodules https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf | git clone --recurse-submodules https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf | ||
Cloning into 'wikidata-query-rdf'... | Cloning into 'wikidata-query-rdf'... | ||
Line 12: | Line 13: | ||
Receiving objects: 100% (17957/17957), 2.98 MiB | 3.37 MiB/s, done. | Receiving objects: 100% (17957/17957), 2.98 MiB | 3.37 MiB/s, done. | ||
Resolving deltas: 100% (8953/8953), done. | Resolving deltas: 100% (8953/8953), done. | ||
+ | </source> | ||
+ | == mvn package == | ||
+ | <source lang='bash' highlight='2-3,7-8,14'> | ||
+ | # Mac OS X | ||
+ | export JAVA_HOME=$(/usr/libexec/java_home -v 1.8.0_191) | ||
+ | java -version | ||
+ | java version "1.8.0_191" | ||
+ | Java(TM) SE Runtime Environment (build 1.8.0_191-b12) | ||
+ | Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) | ||
+ | cd wikidata-query-rdf | ||
+ | mvn -version | ||
+ | Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 2018-10-24T20:41:47+02:00) | ||
+ | Maven home: /opt/local/share/java/maven3 | ||
+ | Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: /Library/Java/JavaVirtualMachines/jdk1.8.0_191.jdk/Contents/Home/jre | ||
+ | Default locale: de_DE, platform encoding: UTF-8 | ||
+ | OS name: "mac os x", version: "10.13.6", arch: "x86_64", family: "mac" | ||
+ | mvn package | ||
+ | </source> | ||
+ | === issue === | ||
+ | see https://phabricator.wikimedia.org/T263855 | ||
+ | <source lang='bash'> | ||
+ | 2020-09-11 13:45:39 INFO RocksDBStateBackend:899 - Attempting to load RocksDB native library and store it under '/var/folders/2t/2g54bjr10830rv00508_y13w0000gn/T/flink-io-6ec39247-613e-410a-a83f-712f841ce3a8' | ||
+ | dyld: lazy symbol binding failed: Symbol not found: ____chkstk_darwin | ||
+ | Referenced from: /private/var/folders/2t/2g54bjr10830rv00508_y13w0000gn/T/flink-io-6ec39247-613e-410a-a83f-712f841ce3a8/rocksdb-lib-396871de50f5fa7595c1071b59c34498/librocksdbjni-osx.jnilib (which was built for Mac OS X 10.15) | ||
+ | Expected in: /usr/lib/libSystem.B.dylib | ||
+ | |||
+ | dyld: Symbol not found: ____chkstk_darwin | ||
+ | Referenced from: /private/var/folders/2t/2g54bjr10830rv00508_y13w0000gn/T/flink-io-6ec39247-613e-410a-a83f-712f841ce3a8/rocksdb-lib-396871de50f5fa7595c1071b59c34498/librocksdbjni-osx.jnilib (which was built for Mac OS X 10.15) | ||
+ | Expected in: /usr/lib/libSystem.B.dylib | ||
+ | </source> | ||
+ | == Try workaround == | ||
+ | <source lang='bash'> | ||
+ | mvn package -D skipTests | ||
</source> | </source> | ||
[[Category:WikiData]] | [[Category:WikiData]] | ||
+ | |||
+ | == service unzip == | ||
+ | Note the usage of. tar x instead of unzip | ||
+ | <source lang='bash' highlight='1-4'> | ||
+ | cd dist/target/ | ||
+ | tar xvfz service-0.3.50-SNAPSHOT-dist.tar.gz | ||
+ | cd service-0.3.50-SNAPSHOT | ||
+ | ls | ||
+ | RWStore.properties loadRestAPI.sh | ||
+ | blazegraph-service-0.3.50-SNAPSHOT.war munge.sh | ||
+ | createNamespace.sh mw-oauth-proxy-0.3.50-SNAPSHOT.war | ||
+ | default.properties mwservices.json | ||
+ | docs prefixes-sdc.conf | ||
+ | forAllCategoryWikis.sh prefixes.conf | ||
+ | jetty-runner-9.4.12.v20180830.jar runBlazegraph.sh | ||
+ | ldf-config.json runStreamingUpdater.sh | ||
+ | lib runUpdate.sh | ||
+ | loadCategoryDaily.sh summarizeEvents.sh | ||
+ | loadCategoryDump.sh wcqs-data-reload.sh | ||
+ | loadData.sh | ||
+ | </source> | ||
+ | == download == | ||
+ | The download of some 275 GB took 16h in total | ||
+ | === first try === | ||
+ | <source lang='bash' highlight='1'> | ||
+ | wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz | ||
+ | --2020-09-25 17:29:02-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz | ||
+ | Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7 | ||
+ | Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected. | ||
+ | HTTP request sent, awaiting response... 200 OK | ||
+ | Length: 275358642076 (256G) [application/octet-stream] | ||
+ | Saving to: ‘latest-all.ttl.gz’ | ||
+ | |||
+ | latest-all.ttl.gz 16%[==> ] 43.12G 4.88MB/s in 2h 39m | ||
+ | |||
+ | 2020-09-25 20:23:55 (4.60 MB/s) - Read error at byte 46303657641/275358642076 (Resource temporarily unavailable, try again.). Retrying. | ||
+ | |||
+ | --2020-09-25 20:23:56-- (try: 2) https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz | ||
+ | Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected. | ||
+ | HTTP request sent, awaiting response... 206 Partial Content | ||
+ | Length: 275358642076 (256G), 229054984435 (213G) remaining [application/octet-stream] | ||
+ | Saving to: ‘latest-all.ttl.gz’ | ||
+ | |||
+ | latest-all.ttl.gz 100%[+++================>] 256.45G 5.01MB/s in 13h 14m | ||
+ | |||
+ | 2020-09-26 09:38:36 (4.58 MB/s) - ‘latest-all.ttl.gz’ saved [275358642076/275358642076] | ||
+ | </source> | ||
+ | === second try === | ||
+ | <source lang='bash' highlight='1'> | ||
+ | wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz | ||
+ | --2020-09-27 07:51:51-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz | ||
+ | Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7 | ||
+ | Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected. | ||
+ | HTTP request sent, awaiting response... 200 OK | ||
+ | Length: 275358642076 (256G) [application/octet-stream] | ||
+ | Saving to: ‘latest-all.ttl.gz’ | ||
+ | |||
+ | latest-all.ttl.gz 100%[===================>] 256.45G 4.65MB/s in 16h 4m | ||
+ | |||
+ | 2020-09-27 23:56:45 (4.54 MB/s) - ‘latest-all.ttl.gz’ saved [275358642076/275358642076] | ||
+ | |||
+ | </source> | ||
+ | |||
+ | = munge = | ||
+ | <source lang='bash' highlight='1'> | ||
+ | nohup ./munge.sh -f data/latest-all.ttl.gz -d data/split -l en,de & | ||
+ | </source> | ||
+ | == Issue == | ||
+ | === Problem === | ||
+ | <pre> | ||
+ | tail -f nohup.out | ||
+ | #logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n | ||
+ | 07:44:31.400 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO org.wikidata.query.rdf.tool.Munge - Switching to data/split/wikidump-000000001.ttl.gz | ||
+ | 07:44:37.263 [main] ERROR org.wikidata.query.rdf.tool.Munge - Fatal error munging RDF | ||
+ | org.openrdf.rio.RDFParseException: Expected '.', found 's' [line 595492] | ||
+ | at org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:440) | ||
+ | at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:685) | ||
+ | at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1405) | ||
+ | at org.openrdf.rio.turtle.TurtleParser.verifyCharacterOrFail(TurtleParser.java:1227) | ||
+ | at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:261) | ||
+ | at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:214) | ||
+ | at org.wikidata.query.rdf.tool.Munge.run(Munge.java:105) | ||
+ | at org.wikidata.query.rdf.tool.Munge.main(Munge.java:59) | ||
+ | </pre> | ||
+ | === Diagnosis === | ||
+ | * https://phabricator.wikimedia.org/T178211 | ||
+ | * https://stackoverflow.com/questions/48020506/wikidata-on-local-blazegraph-expected-an-rdf-value-here-found-line-1 | ||
+ | * try downloading again: problem persists | ||
+ | * https://stackoverflow.com/questions/8151380/how-to-get-few-lines-from-a-gz-compressed-file-without-uncompressing | ||
+ | * [https://stackoverflow.com/a/64131082/1497139 awk one-liner to get a line range from a gzipped file] | ||
+ | |||
+ | === Therapy === | ||
+ | * report https://phabricator.wikimedia.org/T263969 and wait for response |
Latest revision as of 05:53, 30 September 2020
Goal
replicate original blazegraph environment see https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md
Git clone
git clone --recurse-submodules https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf
Cloning into 'wikidata-query-rdf'...
remote: Counting objects: 195, done
remote: Finding sources: 100% (161/161)
remote: Getting sizes: 100% (146/146)
remote: Compressing objects: 100% (131311/131311)
remote: Total 17957 (delta 17), reused 17930 (delta 9)
Receiving objects: 100% (17957/17957), 2.98 MiB | 3.37 MiB/s, done.
Resolving deltas: 100% (8953/8953), done.
mvn package
# Mac OS X
export JAVA_HOME=$(/usr/libexec/java_home -v 1.8.0_191)
java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
cd wikidata-query-rdf
mvn -version
Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 2018-10-24T20:41:47+02:00)
Maven home: /opt/local/share/java/maven3
Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: /Library/Java/JavaVirtualMachines/jdk1.8.0_191.jdk/Contents/Home/jre
Default locale: de_DE, platform encoding: UTF-8
OS name: "mac os x", version: "10.13.6", arch: "x86_64", family: "mac"
mvn package
issue
see https://phabricator.wikimedia.org/T263855
2020-09-11 13:45:39 INFO RocksDBStateBackend:899 - Attempting to load RocksDB native library and store it under '/var/folders/2t/2g54bjr10830rv00508_y13w0000gn/T/flink-io-6ec39247-613e-410a-a83f-712f841ce3a8'
dyld: lazy symbol binding failed: Symbol not found: ____chkstk_darwin
Referenced from: /private/var/folders/2t/2g54bjr10830rv00508_y13w0000gn/T/flink-io-6ec39247-613e-410a-a83f-712f841ce3a8/rocksdb-lib-396871de50f5fa7595c1071b59c34498/librocksdbjni-osx.jnilib (which was built for Mac OS X 10.15)
Expected in: /usr/lib/libSystem.B.dylib
dyld: Symbol not found: ____chkstk_darwin
Referenced from: /private/var/folders/2t/2g54bjr10830rv00508_y13w0000gn/T/flink-io-6ec39247-613e-410a-a83f-712f841ce3a8/rocksdb-lib-396871de50f5fa7595c1071b59c34498/librocksdbjni-osx.jnilib (which was built for Mac OS X 10.15)
Expected in: /usr/lib/libSystem.B.dylib
Try workaround
mvn package -D skipTests
service unzip
Note the usage of. tar x instead of unzip
cd dist/target/
tar xvfz service-0.3.50-SNAPSHOT-dist.tar.gz
cd service-0.3.50-SNAPSHOT
ls
RWStore.properties loadRestAPI.sh
blazegraph-service-0.3.50-SNAPSHOT.war munge.sh
createNamespace.sh mw-oauth-proxy-0.3.50-SNAPSHOT.war
default.properties mwservices.json
docs prefixes-sdc.conf
forAllCategoryWikis.sh prefixes.conf
jetty-runner-9.4.12.v20180830.jar runBlazegraph.sh
ldf-config.json runStreamingUpdater.sh
lib runUpdate.sh
loadCategoryDaily.sh summarizeEvents.sh
loadCategoryDump.sh wcqs-data-reload.sh
loadData.sh
download
The download of some 275 GB took 16h in total
first try
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
--2020-09-25 17:29:02-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 275358642076 (256G) [application/octet-stream]
Saving to: ‘latest-all.ttl.gz’
latest-all.ttl.gz 16%[==> ] 43.12G 4.88MB/s in 2h 39m
2020-09-25 20:23:55 (4.60 MB/s) - Read error at byte 46303657641/275358642076 (Resource temporarily unavailable, try again.). Retrying.
--2020-09-25 20:23:56-- (try: 2) https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 275358642076 (256G), 229054984435 (213G) remaining [application/octet-stream]
Saving to: ‘latest-all.ttl.gz’
latest-all.ttl.gz 100%[+++================>] 256.45G 5.01MB/s in 13h 14m
2020-09-26 09:38:36 (4.58 MB/s) - ‘latest-all.ttl.gz’ saved [275358642076/275358642076]
second try
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
--2020-09-27 07:51:51-- https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620::861:1:208:80:154:7, 208.80.154.7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620::861:1:208:80:154:7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 275358642076 (256G) [application/octet-stream]
Saving to: ‘latest-all.ttl.gz’
latest-all.ttl.gz 100%[===================>] 256.45G 4.65MB/s in 16h 4m
2020-09-27 23:56:45 (4.54 MB/s) - ‘latest-all.ttl.gz’ saved [275358642076/275358642076]
munge
nohup ./munge.sh -f data/latest-all.ttl.gz -d data/split -l en,de &
Issue
Problem
tail -f nohup.out #logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n 07:44:31.400 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO org.wikidata.query.rdf.tool.Munge - Switching to data/split/wikidump-000000001.ttl.gz 07:44:37.263 [main] ERROR org.wikidata.query.rdf.tool.Munge - Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 's' [line 595492] at org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:440) at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:685) at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1405) at org.openrdf.rio.turtle.TurtleParser.verifyCharacterOrFail(TurtleParser.java:1227) at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:261) at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:214) at org.wikidata.query.rdf.tool.Munge.run(Munge.java:105) at org.wikidata.query.rdf.tool.Munge.main(Munge.java:59)
Diagnosis
- https://phabricator.wikimedia.org/T178211
- https://stackoverflow.com/questions/48020506/wikidata-on-local-blazegraph-expected-an-rdf-value-here-found-line-1
- try downloading again: problem persists
- https://stackoverflow.com/questions/8151380/how-to-get-few-lines-from-a-gz-compressed-file-without-uncompressing
- awk one-liner to get a line range from a gzipped file
Therapy
- report https://phabricator.wikimedia.org/T263969 and wait for response