将自定义函数应用于PySpark中数据框的选定列的单元格

时间:2017-07-28 07:50:16

标签: python apache-spark pyspark spark-dataframe

我们说我的数据框看起来像这样:

+---+-----------+-----------+
| id|   address1|   address2|
+---+-----------+-----------+
|  1|address 1.1|address 1.2|
|  2|address 2.1|address 2.2|
+---+-----------+-----------+

我想将自定义函数直接应用于 address1 address2 列中的字符串,例如:

def example(string1, string2):
    name_1 = string1.lower().split(' ')
    name_2 = string2.lower().split(' ')
    intersection_count = len(set(name_1) & set(name_2))

    return intersection_count

我想将结果存储在一个新列中,以便我的最终数据框如下所示:

+---+-----------+-----------+------+
| id|   address1|   address2|result|
+---+-----------+-----------+------+
|  1|address 1.1|address 1.2|     2|
|  2|address 2.1|address 2.2|     7|
+---+-----------+-----------+------+

我试图以我曾经将内置函数应用到整个列的方式执行它,但是我收到了一个错误:

>>> df.withColumn('result', example(df.address1, df.address2))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in example
TypeError: 'Column' object is not callable

我做错了什么以及如何将自定义函数应用于所选列中的字符串?

1 个答案:

答案 0 :(得分:2)

你必须在spark

中使用udf(用户定义的函数)
from __future__ import division
import numpy as np
import pandas as pd
import datetime
import pandas_market_calendars as mcal
from pandas_datareader import data as web
from datetime import date
'''
Full date range:
'''
startrange = datetime.date(2016, 1, 1)
endrange = datetime.date(2016, 12, 31)
'''
Tradable dates in the year:
'''
nyse = mcal.get_calendar('NYSE')
available = nyse.valid_days(start_date='2016-01-01', end_date='2016-12-31')
'''
The loop that needs to take first and last trading date of each month:
'''
dict1 = {}
for i in available:
    start = datetime.date('''first available trade day of the month''')
    end = datetime.date('''last available trade day of the month''')
    diffdays = ((end - start).days)/365
    dict1 [i] = diffdays
    print (dict1)