将函数应用于PySpark DataFrame并创建一个新的DataFrame

时间:2018-09-19 18:45:36

标签: python python-3.x apache-spark dataframe pyspark

我正在尝试使用以下代码:

addresses = spark.sql('''SELECT
                              street_address,
                              city,
                              state,
                              zip_code
                       FROM table''')

results = addresses.rdd.map(callAPI).toDF()



def callAPI(row):
  params = {
        'street_line_1': row.street_address,
        'city': row.city,
        'state_code': row.state,
        'postal_code': row.zip_code}
  response = requests.get('http://localhost:5000', params = params, verify = False).json()
  return Row(**response)

我在跑步时遇到此问题:

    raise ValueError("Some of types cannot be determined by the "
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling

我也尝试过使用createDataFrame传递模式:

results = spark.createDataFrame(results, schema = schema)

但这给了我:

    raise TypeError("%s can not accept object %r in type %s" % (dataType, obj, type(obj)))
TypeError: IntegerType can not accept object '0000' in type <class 'str'>

我的目标是遍历数据框并应用功能,然后获取另一个数据框。 api送还字典我哪里出问题了?

0 个答案:

没有答案