我想将亚马逊数据(metadata.json)放入火花示例电影推荐
中电影推荐使用以下格式,但亚马逊数据使用字符串而不是整数
以下是电影推荐的来源。
Method
[spark MLlib示例 - 电影推荐]
https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html
这是亚马逊数据集
UserID::MovieID::Rating::Timestamp // ratings.dat format
MovieID::Title::Genres // movies.dat format
val ratings = sc.textFile(new File(movieLensHomeDir, "ratings.dat").toString).map { line =>
val fields = line.split("::")
// format: (timestamp % 10, Rating(userId, movieId, rating))
(fields(3).toLong % 10, Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble))
}
val movies = sc.textFile(new File(movieLensHomeDir, "movies.dat").toString).map { line =>
val fields = line.split("::")
// format: (movieId, movieName)
(fields(0).toInt, fields(1))
}.collect().toMap
[亚马逊评论数据集 - 元数据]
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
我想解析这个json文件并将其放在spark示例中。 并且不知道将stringID(asin,title,...)更改为唯一的integerID以及如何获得结果
我继续解析SQLparser,但突然它不起作用,我想知道另一种方式。 起初它没有发生,但可能突然发生错误,因为可能jsonfile格式被破坏了?
答案 0 :(得分:0)
Spark不适用于典型的json文件。 Spark期望每行的json文件都是一个完整的json对象。这就是常规多数组json文件失败的原因,你得到一个[_corrupt_record:String]。所以,你必须稍微更改json以便Spark。我修改了你的json文件,它没有任何问题 -
{"asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17,"imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg","related": { "also_bought": ["B00JHONN1S", "B002BZX8Z6"], "also_viewed":["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W"], "bought_together": ["B002BZX8Z6"] }, "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]]}
{"asin": "0000031853", "title": "AmazonTitle", "price": 5.20,"imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg","related": { "also_bought": ["B00JHONN1S", "B002BZX8Z6"], "also_viewed":["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W"], "bought_together": ["B002BZX8Z6"] }, "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]]}
以下是代码&输出
val rdd= sqlContext.read.json("metadata.json")
rdd: org.apache.spark.sql.DataFrame = [asin: string, brand: string, categories: array<array<string>>, imUrl: string, price: double, related: struct<also_bought:array<string>,also_viewed:array<string>,bought_together:array<string>>, salesRank: struct<Toys & Games:bigint>, title: string