Question

我想做一件非常简单的事情，但无法弄清楚如何在Python / Spark（1.5）/ Dataframe中做到这一点（这对我来说都是新的）。

原始数据集：

code| ISO | country
1   | AFG | Afghanistan state
2   | BOL | Bolivia Plurinational State

新数据集：

code| ISO | country
1   | AFG | Afghanistan
2   | BOL | Bolivia

我想做这样的事情（在伪Python中？）：

iso_to_country_dict = {'AFG': 'Afghanistan', 'BOL': 'Bolivia'}

def mapCountry(iso,country):
    if(iso_to_country_dict[iso] is not empty):
        return iso_to_country_dict[iso]
    return country

dfg = df.select(mapCountry(df['ISO'],df['country']))

为简单起见，mapCountry可能如下所示：

def mapCountry(iso,country):
    if(iso=='AFG'):
        return 'Afghanistan'
    return country

但是这有错误：ValueError: Cannot convert column into bool:

Answer 1

嗯，我找到了解决方案，但不知道这是否是最干净的方法。还有其他想法吗？

iso_to_country_dict = {＆＃39; BOL＆＃39;：＆＃39;玻利维亚＆＃39;＆＃39; HTI＆＃39;：＆＃39;佛得角＆＃39; COD＆＃39 ;：＆＃39;刚果＆＃39;＆＃39; PRK＆＃39;：＆＃39;韩国＆＃39;＆＃39; LAO＆＃39;：＆＃39;老挝＆＃39;}

def mapCountry(iso,country):
    if(iso in iso_to_country_dict):
        return iso_to_country_dict[iso]
    return country

mapCountry=udf(mapCountry)

dfg = df.select(df['iso'],mapCountry(df['iso'],df['country']).alias('country'),df['C2'],df['C3'],df['C4'],df['C5'])

注意：C1，C2，.. C5是所有其他列的名称

Answer 2

我想提供一种不同的方法; UDF总是一种选择，但它们在某种程度上效率低下并且非常麻烦。 when和otherwise范例可以解决此问题。首先，为了提高效率 - 用DataFrame表示字典：

df_iso = spark.createDataFrame([('bol', 'Bolivia'),
                                ('hti', 'Cape-Verde'),
                                ('fra', 'France')], ['iso', 'country'])

然后让我们考虑以下数据：

df_data = spark.createDataFrame(
    map(lambda x: (x, ),
    ['fra', 'esp', 'eng', 'usa', 'bol']), ['data'])

然后我们通过连接进行ISO查找：

df_data = df_data.join(df_iso, F.col('data') == F.col('iso'),
                       'left_outer')

最后，我们根据匹配添加了所需的列（我将其命名为result）：

df_data = df_data.select(
    F.col('data'),
    F.when(F.col('iso').isNull(), F.col('data'))
    .otherwise(F.col('country')).alias('result'))

结果将是：

+----+-------+
|data|    res|
+----+-------+
| esp|    esp|
| bol|Bolivia|
| eng|    eng|
| fra| France|
| usa|    usa|
+----+-------+

Python Spark Dataframes：如何根据来自不同列

2 个答案: