我必须根据值列表向PySpark数据帧添加列。
a= spark.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],["Animal", "Enemy"])
我有一个名为rating的列表,这是每个宠物的评级。
rating = [5,4,1]
我需要使用名为Rating的列附加数据框,例如
+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
| Dog| Cat| 5|
| Cat| Dog| 4|
| Mouse| Cat| 1|
+------+-----+------+
我已完成以下操作,但它只返回评级栏
列表中的第一个值def add_labels():
return rating.pop(0)
labels_udf = udf(add_labels, IntegerType())
new_df = a.withColumn('Rating', labels_udf()).cache()
出:
+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
| Dog| Cat| 5|
| Cat| Dog| 5|
| Mouse| Cat| 5|
+------+-----+------+
答案 0 :(得分:4)
希望这有帮助!
from pyspark.sql.functions import monotonically_increasing_id
#sample data
a= sqlContext.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],
["Animal", "Enemy"])
a.show()
#convert list to a dataframe
rating = [5,4,1]
b = sqlContext.createDataFrame([(l,) for l in rating], ['Rating'])
#join both dataframe to get the final result
a = a.withColumn("row_idx", monotonically_increasing_id())
b = b.withColumn("row_idx", monotonically_increasing_id())
final_df = a.join(b, a.row_idx == b.row_idx).\
drop("row_idx")
final_df.show()
输入:
+------+-----+
|Animal|Enemy|
+------+-----+
| Dog| Cat|
| Cat| Dog|
| Mouse| Cat|
+------+-----+
输出是:
+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
| Cat| Dog| 4|
| Dog| Cat| 5|
| Mouse| Cat| 1|
+------+-----+------+
答案 1 :(得分:2)
正如@Tw UxTLi51Nus所提到的,如果您可以订购DataFrame,请假设,如果不改变您的结果,您可以执行以下操作:
def add_labels(indx):
return rating[indx-1] # since row num begins from 1
labels_udf = udf(add_labels, IntegerType())
a = spark.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],["Animal", "Enemy"])
a.createOrReplaceTempView('a')
a = spark.sql('select row_number() over (order by "Animal") as num, * from a')
a.show()
+---+------+-----+
|num|Animal|Enemy|
+---+------+-----+
| 1| Dog| Cat|
| 2| Cat| Dog|
| 3| Mouse| Cat|
+---+------+-----+
new_df = a.withColumn('Rating', labels_udf('num'))
new_df.show()
+---+------+-----+------+
|num|Animal|Enemy|Rating|
+---+------+-----+------+
| 1| Dog| Cat| 5|
| 2| Cat| Dog| 4|
| 3| Mouse| Cat| 1|
+---+------+-----+------+
然后删除num
列:
new_df.drop('num').show()
+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
| Dog| Cat| 5|
| Cat| Dog| 4|
| Mouse| Cat| 1|
+------+-----+------+
编辑:
另一个 - 但可能是丑陋且有点低效 - 的方式,如果你不能按列排序,就是回到rdd并执行以下操作:
a = spark.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],["Animal", "Enemy"])
# or create the rdd from the start:
# a = spark.sparkContext.parallelize([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")])
a = a.rdd.zipWithIndex()
a = a.toDF()
a.show()
+-----------+---+
| _1| _2|
+-----------+---+
| [Dog,Cat]| 0|
| [Cat,Dog]| 1|
|[Mouse,Cat]| 2|
+-----------+---+
a = a.select(bb._1.getItem('Animal').alias('Animal'), bb._1.getItem('Enemy').alias('Enemy'), bb._2.alias('num'))
def add_labels(indx):
return rating[indx] # indx here will start from zero
labels_udf = udf(add_labels, IntegerType())
new_df = a.withColumn('Rating', labels_udf('num'))
new_df.show()
+---------+--------+---+------+
|Animal | Enemy|num|Rating|
+---------+--------+---+------+
| Dog| Cat| 0| 5|
| Cat| Dog| 1| 4|
| Mouse| Cat| 2| 1|
+---------+--------+---+------+
(如果您有太多数据,我不建议这样做)
希望这有帮助,祝你好运!
答案 2 :(得分:1)
您可以将评分转换为rdd
rating = [5,4,1]
ratingrdd = sc.parallelize(rating)
然后将dataframe
转换为rdd
,使用ratingrdd
将zip
的每个值附加到 rdd dataframe 并转换将rdd 压缩到dataframe
再次
sqlContext.createDataFrame(a.rdd.zip(ratingrdd).map(lambda x: (x[0][0], x[0][1], x[1])), ["Animal", "Enemy", "Rating"]).show()
应该给你
+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
| Dog| Cat| 5|
| Cat| Dog| 4|
| Mouse| Cat| 1|
+------+-----+------+
答案 3 :(得分:0)
您尝试做的事情不起作用,因为rating
列表位于驱动程序的内存中,而a
数据帧位于执行程序的内存中( udf也适用于执行者。)
您需要做的是将密钥添加到ratings
列表中,如下所示:
ratings = [('Dog', 5), ('Cat', 4), ('Mouse', 1)]
然后,您从列表中创建一个ratings
数据框并加入两者以添加新的列:
ratings_df = spark.createDataFrame(ratings, ['Animal', 'Rating'])
new_df = a.join(ratings_df, 'Animal')
答案 4 :(得分:0)
我可能是错的,但我相信已接受的答案将无效。 monotonically_increasing_id
仅保证ID是唯一且不断增加的,而不是ID是连续的。因此,在两个不同的数据帧上使用它可能会创建两个非常不同的列,并且联接将大部分返回为空。
从这个答案https://stackoverflow.com/a/48211877/7225303的灵感启发到类似的问题,我们可以将错误答案更改为:
from pyspark.sql.window import Window as W
from pyspark.sql import functions as F
a= sqlContext.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],
["Animal", "Enemy"])
a.show()
+------+-----+
|Animal|Enemy|
+------+-----+
| Dog| Cat|
| Cat| Dog|
| Mouse| Cat|
+------+-----+
#convert list to a dataframe
rating = [5,4,1]
b = sqlContext.createDataFrame([(l,) for l in rating], ['Rating'])
b.show()
+------+
|Rating|
+------+
| 5|
| 4|
| 1|
+------+
a = a.withColumn("idx", F.monotonically_increasing_id())
b = b.withColumn("idx", F.monotonically_increasing_id())
windowSpec = W.orderBy("idx")
a = a.withColumn("idx", F.row_number().over(windowSpec))
b = b.withColumn("idx", F.row_number().over(windowSpec))
a.show()
+------+-----+---+
|Animal|Enemy|idx|
+------+-----+---+
| Dog| Cat| 1|
| Cat| Dog| 2|
| Mouse| Cat| 3|
+------+-----+---+
b.show()
+------+---+
|Rating|idx|
+------+---+
| 5| 1|
| 4| 2|
| 1| 3|
+------+---+
final_df = a.join(b, a.idx == b.idx).drop("idx")
+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
| Dog| Cat| 5|
| Cat| Dog| 4|
| Mouse| Cat| 1|
+------+-----+------+