我有一个pyspark数据框,其中包含一列和十行。我在上面的代码中删除了其他列。 看起来像这样:
+--------------------+
| movieTitle|
+--------------------+
|Across the Sea of...|
|Dog of Flanders, ...|
| Bootmen (2000)|
|Relax... It's Jus...|
|Mating Habits of ...|
| Belly (1998)|
| Taffin (1988)|
|Love and Other Ca...|
|Shattered Image (...|
|Price Above Rubie...|
+--------------------+
我需要用索引打印出前5行。格式如下:
Movies recommended for you:
1: Silence of the Lambs, The (1991)
2: Saving Private Ryan (1998)
3: Godfather, The (1972)
4: Star Wars: Episode 6 - A New Hope (1977)
5: Shawshank Redemption, The (1994)
不一定要是那些确切的电影,只要是那种格式。我尝试将其更改为RDD和pandas数据框,并对其进行迭代,但是两者均出现错误。有没有简单的方法可以做到这一点?
谢谢!
答案 0 :(得分:1)
您可以使用collect()
在movieTitle
列中创建值列表,然后简单地对其进行迭代:
movies_list = df.select("movieTitle").collect()
n = 5
for i in range(n):
print(str("%s: %s" % (i+1,movies_list[i][0])))
输出:
1: Silence of the Lambs, The (1991)
2: Saving Private Ryan (1998)
3: Godfather, The (1972)
4: Star Wars: Episode 6 - A New Hope (1977)
5: Shawshank Redemption, The (1994)
如果要向pyspark数据框添加索引,可以使用row_number
。 (我使用的是没有分区的窗口,应该适合您的数据)
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, col, concat, lit
w = Window().orderBy("movieTitle")
df = df.withColumn("row_num", row_number().over(w)).withColumn("movieTitle",\
concat(col("row_num"), lit(": "), col("movieTitle"))).drop('row_num')
movies_list = df.select("movieTitle").collect()
n = 5
for i in range(n):
print(movies_list[i][0])