如何将亚马逊example-dataset放入spark中的spark movie-recommended?

时间:2017-05-21 09:49:45

标签: apache-spark

我想将亚马逊数据(metadata.json)放入火花示例电影推荐

电影推荐使用以下格式,但亚马逊数据使用字符串而不是整数

以下是电影推荐的来源。

Method

[spark MLlib示例 - 电影推荐]

https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html

这是亚马逊数据集

UserID::MovieID::Rating::Timestamp // ratings.dat format
MovieID::Title::Genres // movies.dat format

val ratings = sc.textFile(new File(movieLensHomeDir, "ratings.dat").toString).map { line =>
  val fields = line.split("::")
  // format: (timestamp % 10, Rating(userId, movieId, rating))
  (fields(3).toLong % 10, Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble))
}

val movies = sc.textFile(new File(movieLensHomeDir, "movies.dat").toString).map { line =>
  val fields = line.split("::")
  // format: (movieId, movieName)
  (fields(0).toInt, fields(1))
}.collect().toMap

[亚马逊评论数据集 - 元数据]

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

  1. 我想解析这个json文件并将其放在spark示例中。 并且不知道将stringID(asin,title,...)更改为唯一的integerID以及如何获得结果

  2. 我继续解析SQLparser,但突然它不起作用,我想知道另一种方式。 起初它没有发生,但可能突然发生错误,因为可能jsonfile格式被破坏了?

    SQLContext_error.jpg

1 个答案:

答案 0 :(得分:0)

Spark不适用于典型的json文件。 Spark期望每行的json文件都是一个完整的json对象。这就是常规多数组json文件失败的原因,你得到一个[_corrupt_record:String]。所以,你必须稍微更改json以便Spark。我修改了你的json文件,它没有任何问题 -

{"asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17,"imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg","related": {  "also_bought": ["B00JHONN1S", "B002BZX8Z6"],    "also_viewed":["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W"],    "bought_together": ["B002BZX8Z6"]  },  "salesRank": {"Toys & Games": 211836},  "brand": "Coxlures",  "categories": [["Sports & Outdoors", "Other Sports", "Dance"]]}    
{"asin": "0000031853", "title": "AmazonTitle", "price": 5.20,"imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg","related": {  "also_bought": ["B00JHONN1S", "B002BZX8Z6"],    "also_viewed":["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W"],    "bought_together": ["B002BZX8Z6"]  },  "salesRank": {"Toys & Games": 211836},  "brand": "Coxlures",  "categories": [["Sports & Outdoors", "Other Sports", "Dance"]]}

以下是代码&输出

val rdd= sqlContext.read.json("metadata.json")
rdd: org.apache.spark.sql.DataFrame = [asin: string, brand: string, categories: array<array<string>>, imUrl: string, price: double, related: struct<also_bought:array<string>,also_viewed:array<string>,bought_together:array<string>>, salesRank: struct<Toys & Games:bigint>, title: string