Question

我真的希望能够在spark数据框的整个列上运行复杂的功能，就像我在熊猫中使用apply函数一样。

例如，在Pandas中，我有一个apply函数，该函数接受像sub-subdomain.subdomain.facebook.co.nz/somequerystring这样的凌乱域，并仅输出facebook.com。

我将如何在Spark中做到这一点？

我看过UDF，但不清楚如何在单列上运行它。

比方说，我有一个简单的函数，如下所示，该函数从pandas DF列中提取日期的不同位：

def format_date(row):
    year = int(row['Contract_Renewal'][7:])
    month = int(row['Contract_Renewal'][4:6])
    day = int(row['Contract_Renewal'][:3])
    date = datetime.date(year, month, day)
    return date-now

在熊猫中，我会这样称呼：

df['days_until'] = df.apply(format_date, axis=1)

我可以在Pyspark中实现相同的目标吗？

Answer 1

在这种情况下，您可以使用regexp_extract（http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substring#pyspark.sql.functions.regexp_extract），regexp_replace（http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substring#pyspark.sql.functions.regexp_replace）和split（{ {3}}）以重新格式化字符串的日期。

它不像定义自己的函数并像熊猫一样使用apply干净，但它比定义熊猫/火花UDF的性能要好。

祝你好运！

相当于pyspark中的熊猫吗？

1 个答案: