我对Movielens Data做一些分析。因此在u.item中,数据就是这种形式。
movie id | movie title | release date | video release date |
IMDb URL | unknown | Action | Adventure | Animation |
Children's | Comedy | Crime | Documentary | Drama | Fantasy |
Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
Thriller | War | Western |
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
您可以在此数据的第5列到第23列中看到,我们以0和1的形式显示了流派。因此,我正在尝试将这些流派从0,1转换为数字,例如
unknown - 0
Action - 1
etc
播下我到目前为止所做的事情
def refineMovieDF(row):
genre=[]
movieData =row.split("|")
for i in range(len(movieData[5,25])):
if movieData[i] ==1:
genre.append(i)
return Row(MovieId = movieData[0],Genre=genre)
movieDF = spark.read.load("ml-100k/u.item",format="csv",inferSchema=True, header=False)
movieRefined = movieDF.rdd.map(refineMovieDF).toDF().collect()
我在拆分时遇到错误
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 253, in main
process()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 248, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 379, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1352, in takeUpToNumLeft
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
return f(*args, **kwargs)
File "/home/cloudera/workspace/MovielensAnalysis.py", line 13, in refineMovieDF
movieData =row.split("|")
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1561, in __getattr__
raise AttributeError(item)
AttributeError: split
此方法正确吗?
我该如何解决此错误?
答案 0 :(得分:0)
我假设您想将0
列中的所有1
和unknown
组合到最后一列,以每一行为一个数组。然后创建一个仅包含所有1
的索引的新数组。如果是这样,这就是我会做的。
我将首先使用functions.array
组合所有0
和1
:
cols = movieDF.columns
movieDF = movieDF.withColumn("genre", F.array(cols[5:]))
movieDF = movieDF.select(['movie id', 'movie title', 'genre'])
movieDF.show()
以下是输出:
+--------+-----------------+--------------------+
|movie id| movie title| genre|
+--------+-----------------+--------------------+
| 1| Toy Story (1995)|[0, 0, 0, 1, 1, 1...|
| 2| GoldenEye (1995)|[0, 1, 1, 0, 0, 0...|
| 3|Four Rooms (1995)|[0, 0, 0, 0, 0, 0...|
+--------+-----------------+--------------------+
然后,我将使用udf
获取所有1
的索引:
def get_index_of_one(g):
return [idx for idx, k in enumerate(g) if k == 1]
myudf = F.udf(lambda g: get_index_of_one(g), ArrayType(IntegerType()))
movieDF = movieDF.withColumn('genre2', myudf('genre'))
movieDF.show()
这是最终输出:
+--------+-----------------+--------------------+----------+
|movie id| movie title| genre| genre2|
+--------+-----------------+--------------------+----------+
| 1| Toy Story (1995)|[0, 0, 0, 1, 1, 1...| [3, 4, 5]|
| 2| GoldenEye (1995)|[0, 1, 1, 0, 0, 0...|[1, 2, 16]|
| 3|Four Rooms (1995)|[0, 0, 0, 0, 0, 0...| [16]|
+--------+-----------------+--------------------+----------+
希望这就是您想要的。