使用数据框调用地图功能

时间:2019-08-04 12:53:03

标签: python dataframe pyspark

我对Movielens Data做一些分析。因此在u.item中,数据就是这种形式。

 movie id | movie title | release date | video release date |
 IMDb URL | unknown | Action | Adventure | Animation |
 Children's | Comedy | Crime | Documentary | Drama | Fantasy |
 Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
 Thriller | War | Western |

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

您可以在此数据的第5列到第23列中看到,我们以0和1的形式显示了流派。因此,我正在尝试将这些流派从0,1转换为数字,例如

unknown - 0
Action - 1
etc

播下我到目前为止所做的事情

def refineMovieDF(row):
genre=[]
movieData =row.split("|")
for i in range(len(movieData[5,25])):
    if movieData[i] ==1:
        genre.append(i)
return Row(MovieId = movieData[0],Genre=genre)
movieDF = spark.read.load("ml-100k/u.item",format="csv",inferSchema=True, header=False)

movieRefined = movieDF.rdd.map(refineMovieDF).toDF().collect()

我在拆分时遇到错误

 Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 253, in main
process()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 248, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 379, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1352, in takeUpToNumLeft
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
return f(*args, **kwargs)
File "/home/cloudera/workspace/MovielensAnalysis.py", line 13, in refineMovieDF
movieData =row.split("|")
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1561, in __getattr__
raise AttributeError(item)
AttributeError: split

此方法正确吗?

我该如何解决此错误?

1 个答案:

答案 0 :(得分:0)

我假设您想将0列中的所有1unknown组合到最后一列,以每一行为一个数组。然后创建一个仅包含所有1的索引的新数组。如果是这样,这就是我会做的。 我将首先使用functions.array组合所有01

cols = movieDF.columns
movieDF = movieDF.withColumn("genre", F.array(cols[5:]))
movieDF = movieDF.select(['movie id', 'movie title', 'genre'])
movieDF.show()

以下是输出:

+--------+-----------------+--------------------+
|movie id|      movie title|               genre|
+--------+-----------------+--------------------+
|       1| Toy Story (1995)|[0, 0, 0, 1, 1, 1...|
|       2| GoldenEye (1995)|[0, 1, 1, 0, 0, 0...|
|       3|Four Rooms (1995)|[0, 0, 0, 0, 0, 0...|
+--------+-----------------+--------------------+

然后,我将使用udf获取所有1的索引:

def get_index_of_one(g):
    return [idx for idx, k in enumerate(g) if k == 1]

myudf = F.udf(lambda g: get_index_of_one(g), ArrayType(IntegerType()))
movieDF = movieDF.withColumn('genre2', myudf('genre'))
movieDF.show()

这是最终输出:

+--------+-----------------+--------------------+----------+
|movie id|      movie title|               genre|    genre2|
+--------+-----------------+--------------------+----------+
|       1| Toy Story (1995)|[0, 0, 0, 1, 1, 1...| [3, 4, 5]|
|       2| GoldenEye (1995)|[0, 1, 1, 0, 0, 0...|[1, 2, 16]|
|       3|Four Rooms (1995)|[0, 0, 0, 0, 0, 0...|      [16]|
+--------+-----------------+--------------------+----------+

希望这就是您想要的。