Question

我想在数据帧上附加一个新专栏＆＃34; df＆＃34;来自函数get_distance：

def get_distance(x, y):
    dfDistPerc = hiveContext.sql("select column3 as column3, \
                                  from tab \
                                  where column1 = '" + x + "' \
                                  and column2 = " + y + " \
                                  limit 1")

    result = dfDistPerc.select("column3").take(1)
    return result

df = df.withColumn(
    "distance",
    lit(get_distance(df["column1"], df["column2"]))
)

但是，我明白了：

TypeError: 'Column' object is not callable

我认为这是因为x和y是Column个对象，我需要转换为String才能在我的查询中使用。我对吗？如果是这样，我该怎么做？

Answer 1

您不能直接在Column个对象上使用Python函数，除非它旨在对Column个对象/表达式进行操作。您需要udf：
```
@udf
def get_distance(x, y):
    ...
```
但你不能在udf（或一般的mapper）中使用SQLContext。

只需join：

tab = hiveContext.table("tab").groupBy("column1", "column2").agg(first("column3"))
df.join(tab, ["column1", "column2"])

Answer 2

Spark应该知道你正在使用的函数不是普通函数而是UDF。

因此，有两种方法可以在数据帧上使用UDF。

方法1：使用@udf注释

def get_distance(x, y):
    dfDistPerc = hiveContext.sql("select column3 as column3, \
                                  from tab \
                                  where column1 = '" + x + "' \
                                  and column2 = " + y + " \
                                  limit 1")

    result = dfDistPerc.select("column3").take(1)
    return result

calculate_distance_udf = udf(get_distance, IntegerType())

df = df.withColumn(
    "distance",
    lit(calculate_distance_udf(df["column1"], df["column2"]))
)

方法-2：使用pyspark.sql.functions.udf重新设置udf

https://api.telegram.org/bot<token>/setWebhook

TypeError：＆＃39;列＆＃39;使用WithColumn无法调用对象

2 个答案: