
时间:2014-10-17 00:10:36

标签: java unicode rdf jena

我正在加载最近的DBPedia转储文件,特别是http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/en/short_abstracts_en.nt.bz2提供的short_abstracts_en.nt(警告,409M文件)。 tdbloader2无法加载,使用:

org.apache.jena.riot.RiotException: [line: 1263473, col: 122] Not a hexadecimal character:

我可以使用riot --validate

$JENA_HOME/bin/riot --validate /var/data/uncompressed/short_abstracts_en.nt
20:04:36 ERROR riot                 :: [line: 1263473, col: 122] Not a hexadecimal character:


<http://dbpedia.org/resource/Taiwanese_kana> <http://www.w3.org/2000/01/rdf-schema#comment> "Taiwanese kana (\u30BF\u30A \u30F2\u30A1\u30CC \u30AE\u30A \u30AB\u30A \u30D3\u30A7\u30F ) is a katakana-based writing system once used to write Holo Taiwanese, when Taiwan was ruled by Japan. It functioned as a phonetic guide to hanzi, much like furigana in Japanese or Zhuyin fuhao in Chinese. There were similar systems for other languages in Taiwan as well, including Hakka and Formosan languages.The system was imposed by Japan at the time, and used in a few dictionaries, as well as textbooks."@en .

第122列是unicode字符集的一部分:(\ u30BF \ u30A \ u30F2 \ u30A1 \ u30CC \ u30A \ u30A \ u30A \ u30A \ u30A3 \ u30A7 \ u30F)(带列122粗体: \ u30F2 )。

错误消息是正确的:\ u30F2是(有效)unicode字符,而不是十六进制字符。


0 个答案:
