合并对象与熊猫数据框

时间:2020-08-03 19:23:46

标签: python pandas dataframe

在下面您看到我有一个名为westCountries的对象,在下面您将看到我有一个名为countryDf的数据框。

westCountries = {'West': ['US', 'CA', 'PR']}
# countryDF

      Country 
0        [US]
1        [PR]
2        [CA]
3        [HK]

我想知道如何将westCountries obj包含在名为Location的新列中的数据框中?我已经尝试过合并,但实际上并没有做任何事情,因为奇怪的是,我需要此列中的值作为对象中键的名称,如下所示。注意:此输出仅是示例,我了解那里与我提供的数据和所需的输出之间缺少相关性。

  Country Location
0      US     West
1      CA     West

我正在考虑做一些事情,例如:

  • 使用.isin(),然后对该数据框进行更多的转换/计算,以填充我的数据框,但是这种方法对我来说似乎有点模糊。
  • 使用df.loc [...]将数据框与该列表中的值进行比较,然后我可以使用自己选择的值创建自己的列。
  • 将对象转换为数据框,然后在此临时数据框中创建一个新列,然后按国家/地区合并,这样我就可以将locations列包含到我的countryDF数据框中。

但是,我觉得可能有比我上面列出的所有这些方法更为完善的解决方案。这就是为什么我要寻求帮助。

3 个答案:

答案 0 :(得分:2)

  • 使用pandas.DataFrame.explode从列表中删除值
  • 使用list comprehension将值与westCountries值列表匹配并返回key
  • 在示例中,示例数据框列值创建为字符串,并且需要使用ast.literal_eval转换为dict类型
import pandas as pd
from ast import literal_eval  # only for setting up the test dataframe

# setup the test dataframe
data = {'Country': ["['US']", "['PR']", "['CA']", "['HK']"]}
df = pd.DataFrame(data)
df.Country = df.Country.apply(literal_eval)  # only for the test data

westCountries = {'West': ['US', 'CA', 'PR']}

# remove the values from lists, with explode
df = df.explode('Country')

# create the Loc column using apply
df['Loc'] = df.Country.apply(lambda x: [k if x in v else None for k, v in westCountries.items()][0])

# drop rows with None
df = df.dropna()

# display(df)
  Country   Loc
0      US  West
1      PR  West
2      CA  West

选项2(更好):

  • 在第一个选项中,对于每一行,.apply必须使用key-value遍历westCountries中的每一对[k if x in v else None for k, v in westCountries.items()],这很慢。
  • 最好使用westCountriesdict重塑为平坦的value,并以statedict comprehension为键的区域。
  • 使用pandas.Series.mapdict值映射到新列中
import pandas as pd
from ast import literal_eval  # only for setting up the test dataframe

# setup the test dataframe
data = {'Country': ["['US']", "['PR']", "['CA']", "['HK']"]}
df = pd.DataFrame(data)
df.Country = df.Country.apply(literal_eval)  # only for the test data

# remove the values from lists, with explode
df = df.explode('Country')

# given
westCountries = {'West': ['US', 'CA', 'PR'], 'East': ['NY', 'NC']}

# unpack westCountries where all values are keys and key are values
mapped = {x: k for k, v in westCountries.items() for x in v}

# print(mapped)
{'US': 'West', 'CA': 'West', 'PR': 'West', 'NY': 'East', 'NC': 'East'}

# map the dict to the column
df['Loc'] = df.Country.map(mapped)

# dropna
df = df.dropna()

答案 1 :(得分:1)

您可以使用pd.melt,然后使用df.explodedf.merge炸开df

westCountries = {'West': ['US', 'CA', 'PR']}
west = pd.melt(pd.DataFrame(westCountries), var_name='Loc', value_name='Country')

df.explode('Country').merge(west, on='Country')
  Country   Loc
0      US  West
1      PR  West
2      CA  West

详细信息

pd.DataFrame(westCountries)

#  West
#0   US
#1   CA
#2   PR

# Now melt the above dataframe
pd.melt(pd.DataFrame(westCountries), var_name='Loc', value_name='Country')

#    Loc Country
#0  West      US
#1  West      CA
#2  West      PR

# Now, merge `df` after exploding with `west` on `Country`
df.explode('Country').merge(west, on='Country') # how = 'left' by default in merge

#  Country   Loc
#0      US  West
#1      PR  West
#2      CA  West

编辑:

如果您的westCountries字典大小不相等,请尝试

from itertools import zip_longest

westCountries = {'West': ['US', 'CA', 'PR'], 'East': ['NY', 'NC']}

west = pd.DataFrame(zip_longest(*westCountries.values(),fillvalue = np.nan),
                    columns= westCountries.keys())
west = west.melt(var_name='Loc', value_name='Country').dropna()

df.explode('Country').merge(west, on='Country')

上述示例:

df
  Country
0    [US]
1    [PR]
2    [CA]
3    [HK]
4    [NY] #--> added `NY` from `East`.

westCountries = {'West': ['US', 'CA', 'PR'], 'East': ['NY', 'NC']}

west = pd.DataFrame(zip_longest(*westCountries.values(),fillvalue = np.nan),
                    columns= westCountries.keys())
west = west.melt(var_name='Loc', value_name='Country').dropna()
df.explode('Country').merge(west, on='Country')

#  Country   Loc
#0      US  West
#1      PR  West
#2      CA  West
#3      NY  East

答案 2 :(得分:0)

就运行时间而言,这可能不是最快的方法,但它可行

import pandas as pd

westCountries = {'West': ['US', 'CA', 'PR']}
df = pd.DataFrame(["[US]","[PR]", "[CA]", "[HK]"], columns=["Country"])

df = df.assign(Location="")
for index, row in df.iterrows():
    if any([True for country in westCountries.get('West') if country in row['Country']]):
    row.Location='West'

west_df = df[df['Location'] != ""]