我正在使用cdh5 quickstart vm,我有一个这样的文件(这里没有完整):
{"user_id": "kim95",
"type": "Book",
"title": "Modern Database Systems: The Object Model, Interoperability, and
Beyond.",
"year": "1995",
"publisher": "ACM Press and Addison-Wesley",
"authors": {},
"source": "DBLP"
}
{"user_id": "marshallo79",
"type": "Book",
"title": "Inequalities: Theory of Majorization and Its Application.",
"year": "1979",
"publisher": "Academic Press",
"authors": {("Albert W. Marshall"), ("Ingram Olkin")},
"source": "DBLP"
}
我用过这个脚本:
books = load 'data/book-seded.json'
using JsonLoader('t1:tuple(user_id:
chararray,type:chararray,title:chararray,year:chararray,publisher:chararray,source:chararray,authors:bag{T:tuple(author:chararray)})');
STORE books INTO 'book-no-seded.tsv';
脚本有效,但生成的文件是空的,你有什么想法吗?
答案 0 :(得分:1)
最后,只有这个模式工作:如果我添加或删除与此配置不同的空格,那么我将会有一个错误(我还为元组添加了“name”,当它为空时指定了“null”,并更改了顺序在作者和来源之间,但即使没有这种配置,它仍然是错误的)
{"user_id": "kim95", "type": "Book","title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.", "year": "1995", "publisher": "ACM Press and Addison-Wesley", "authors": [{"name":null"}], "source": "DBLP"}
{"user_id": "marshallo79", "type": "Book", "title": "Inequalities: Theory of Majorization and Its Application.", "year": "1979", "publisher": "Academic Press", "authors": [{"name":"Albert W. Marshall"},{"name":"Ingram Olkin"}], "source": "DBLP"}
工作脚本就是这个:
books = load 'data/book-seded-workings-reduced.json'
using JsonLoader('user_id:chararray,type:chararray,title:chararray,year:chararray,publisher:chararray,authors:{(name:chararray)},source:chararray');
STORE books INTO 'book-table.csv'; //whether .tsv or .csv
答案 1 :(得分:0)
使用USING org.apache.pig.piggybank.storage.JsonStorage();尝试将书籍输入INTO'book-no-seded.tsv'中。
答案 2 :(得分:0)
您需要确保LOAD架构良好。您可以尝试DUMP books
进行快速检查。
当我们在本教程http://gethue.com/hadoop-tutorials-ii-1-prepare-the-data-for-analysis/中使用Pig JsonLoader时,我们必须小心输入数据和模式。