Question

我有一个包含2列A和B的pyspark数据帧。我需要根据A列的值对B的行进行不同的处理。在普通的大熊猫中，我可以这样做：

import pandas as pd
funcDict = {}
funcDict['f1'] = (lambda x:x+1000)
funcDict['f2'] = (lambda x:x*x)
df = pd.DataFrame([['a',1],['b',2],['b',3],['a',4]], columns=['A','B'])
df['newCol'] = df.apply(lambda x: funcDict['f1'](x['B']) if x['A']=='a' else funcDict['f2']
(x['B']), axis=1)

我想在（py）spark中做的简单方法是

使用文件

将数据读入数据框
按列A分区并写入单独的文件（write.partitionBy）
读取每个文件，然后分别进行处理

否则

使用expr

将数据读入数据框
（从可读性/维护性的角度出发）编写笨拙的expr，以根据列的值有条件地做一些不同的事情
这在任何地方看起来都不像上面的熊猫代码看起来“干净”

还有其他什么方法可以处理此要求吗？从效率的角度来看，我希望第一种方法更简洁，但是由于采用分区读写方式，因此运行时间更长；从代码的角度来看，第二种方法不那么好，并且很难扩展和维护。

更主要的是，您是否会选择使用完全不同的内容（例如，消息队列）（尽管存在相对的延迟差异）？

编辑1

基于我对pyspark的有限了解，用户pissall（https://stackoverflow.com/users/8805315/pissall）提出的解决方案可以工作，只要处理过程不是很复杂。如果发生这种情况，我不知道如何不依靠UDF来做到这一点，因为UDF具有自身的缺点。考虑下面的简单示例

# create a 2-column data frame
# where I wish to extract the city 
# in column B differently based on
# the type given in column A
# This requires taking a different 
# substring (prefix or suffix) from column B
df = sparkSession.createDataFrame([
  (1, "NewYork_NY"),
  (2, "FL_Miami"),
  (1, "LA_CA"),
  (1, "Chicago_IL"),
  (2,"PA_Kutztown")
], ["A", "B"])

# create UDFs to get left and right substrings
# I do not know how to avoid creating UDFs
# for this type of processing
getCityLeft = udf(lambda x:x[0:-3],StringType())
getCityRight = udf(lambda x:x[3:],StringType())

#apply UDFs
df = df.withColumn("city", F.when(F.col("A") == 1, getCityLeft(F.col("B"))) \
                            .otherwise(getCityRight(F.col("B"))))

是否有一种方法可以更简单地实现而不使用UDF？如果我使用expr，我可以这样做，但是正如我之前提到的，它看起来并不优雅。

Answer 1

使用when怎么办？

import pyspark.sql.functions as F

df = df.withColumn("transformed_B", F.when(F.col("A") == "a", F.col("B") + 1000).otherwise(F.col("B") * F.col("B")))

在进一步澄清问题后进行编辑：

您可以在split上使用_，并根据自己的情况选择第一部分或第二部分。

这是预期的输出吗？

df.withColumn("city", F.when(F.col("A") == 1, F.split("B", "_")[0]).otherwise(F.split("B", "_")[1])).show()

+---+-----------+--------+
|  A|          B|    city|
+---+-----------+--------+
|  1| NewYork_NY| NewYork|
|  2|   FL_Miami|   Miami|
|  1|      LA_CA|      LA|
|  1| Chicago_IL| Chicago|
|  2|PA_Kutztown|Kutztown|
+---+-----------+--------+

UDF方法：

def sub_string(ref_col, city_col):
    # ref_col is the reference column (A) and city_col is the string we want to sub (B)
    if ref_col == 1:
        return city_col[0:-3]
    return city_col[3:]

sub_str_udf = F.udf(sub_string, StringType())
df = df.withColumn("city", sub_str_udf(F.col("A"), F.col("B")))

另外，请查看：remove last few characters in PySpark dataframe column

根据Spark数据帧中的列值执行不同的计算

1 个答案:

UDF方法：