Question

在下面的示例中，我试图创建一个新列parseFloat(profit)。我想要的是查找df1['new']的值，并查看它们是否是df1['city']中任何行的子字符串。如果是这样，我希望df2['des']具有df1['new']的值（在此示例中，是对城市的描述）。

df2['des']:：

df1['city']

city 0 New York 1 Amsterdam 2 London 3 Karachi：

df2['des']

这就是我想要的

    des
0   London is the capital and ...
1   Amsterdam and New York are two...
2   Karachi is the capital of...

此刻，我要解决的最接近的问题是：

        city                                  new
0   New York    Amsterdam and New York are two...
1  Amsterdam    Amsterdam and New York are two...
2     London        London is the capital and ...
3    Karachi         Karachi is the capital of...

哪个输出：

df['new'] = df.loc[df.des.str.contains("London"), 'des']

我想要的是，而不是仅在条件中传递city new 0 New York NaN 1 Amsterdam NaN 2 London London is the capital and ... 3 Karachi NaN，而是传递整个系列"London"。如果我这样做，则会收到此错误：df1['city']

Answer 1

假设匹配项重复，您只想一个匹配项。否则，任何解决方案都会更复杂。

遇到这些问题，与其遍历行，不如遍历城市并使用pd.Series.str.contains，通常会更好。例如，您可以创建一个字典：

d = {city: df2.loc[df2['des'].str.contains(city, regex=False), 'des'].iat[0] \
     for city in df1['city']}

然后通过pd.Series.map映射到df1：

df1['des'] = df1['city'].map(d).fillna('No match found!')

Answer 2

使用列表推导的另一种解决方案：

df1['new'] = [next((i for i in df2['des'] if x in i), 'Not found!') for x in df1['city']]

另一个使用正则表达式和str.extractall：

matches = df2['des'].str.extractall('({})'.format('|'.join(df1['city']))).reset_index(0)
m = matches.set_index(0)['level_0'].map(df2['des'])
df1['new'] = df1['city'].map(m).fillna('No match!')

对于系列中的每个值，如果series1的值是series2中的子字符串，则从另一个熊猫系列中返回值

2 个答案: