我正在加载最近的DBPedia转储文件,特别是http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/en/short_abstracts_en.nt.bz2提供的short_abstracts_en.nt(警告,409M文件)。 tdbloader2无法加载,使用:
org.apache.jena.riot.RiotException: [line: 1263473, col: 122] Not a hexadecimal character:
我可以使用riot --validate
复制此错误$JENA_HOME/bin/riot --validate /var/data/uncompressed/short_abstracts_en.nt
20:04:36 ERROR riot :: [line: 1263473, col: 122] Not a hexadecimal character:
该文件的第1263473行如下所示:
<http://dbpedia.org/resource/Taiwanese_kana> <http://www.w3.org/2000/01/rdf-schema#comment> "Taiwanese kana (\u30BF\u30A \u30F2\u30A1\u30CC \u30AE\u30A \u30AB\u30A \u30D3\u30A7\u30F ) is a katakana-based writing system once used to write Holo Taiwanese, when Taiwan was ruled by Japan. It functioned as a phonetic guide to hanzi, much like furigana in Japanese or Zhuyin fuhao in Chinese. There were similar systems for other languages in Taiwan as well, including Hakka and Formosan languages.The system was imposed by Japan at the time, and used in a few dictionaries, as well as textbooks."@en .
第122列是unicode字符集的一部分:(\ u30BF \ u30A \ u30F2 \ u30A1 \ u30CC \ u30A \ u30A \ u30A \ u30A \ u30A3 \ u30A7 \ u30F)(带列122粗体: \ u30F2 )。
错误消息是正确的:\ u30F2是(有效)unicode字符,而不是十六进制字符。
为什么Jena认为它应该是十六进制,我该怎么办呢?