Question

我有以下Pandas代码，我试图用字符串<country>替换国家名称。

df['title_type2'] = df['title_type']
countries = open(r'countries.txt').read().splitlines()    # Reads all lines into a list and removes \n.
countries = [country.replace(' ', r'\s') for country in countries]
pattern = r'\b' + '|'.join(countries) + r'\b'
df['title_type2'].str.replace(pattern, '<country>')

但是，我无法让有空间的国家（例如韩国）正常工作，因为它们不会被替换。问题似乎是我的\s变成了\\s。如何避免这种情况或如何解决该问题？

Answer 1

无需用 \ s 替换任何空格。

您的模式应该包括：

\b-“开始”字边界，
(?:...|...|...)一个不带国家名称（替代）的捕获组，
\b-“结束”字边界，

类似：

pattern = r'\b(?:China|South Korea|Taiwan)\b'

然后您可以进行替换：

df['title_type2'].str.replace(pattern, '<country>')

我创建了如下测试数据：

df = pd.DataFrame(['Abc Taiwan', 'Xyz China', 'Zxx South Korea', 'No country name'],
    columns=['title_type'])
df['title_type2'] = df['title_type']

并得到：

0      Abc <country>
1      Xyz <country>
2      Zxx <country>
3    No country name
Name: title_type2, dtype: object

熊猫正则表达式替换为多个值和值中的空格

1 个答案: