Question

所以我可能会有一个字符串'中国银行'，'中国大使馆'和'国际中国'

我想替换所有国家/地区实例，除非我们有“of”或“of”

显然，这可以通过遍历国家/地区列表，检查名称是否包含国家/地区，然后检查国家/地区的'或'之前是否存在来实现。

如果这些确实存在，那么我们不会删除该国家/地区，否则我们会删除国家/地区。示例将变为：

'中国银行'，或'中国大使馆'，'国际'

然而，迭代可能会很慢，特别是当您有大量国家/地区和大量文本替换时。

是否有更快且更有条件的替换字符串的方法？这样我仍然可以使用Python re库进行简单的模式匹配吗？

我的功能如下：

def removeCountry(name):
    for country in countries:
        if country in name:
            if 'of ' + country in name:
                return name
            if 'of the ' + country in name:
                return name
            else:
                name =  re.sub(country + '$', '', name).strip()
                return name
    return name

编辑：我确实找到了一些信息here。这确实描述了如何做一个if，但我真的想要一个如果不是'的' 如果不是'的' 然后替换......

Answer 1

我认为您可以使用Python: how to determine if a list of words exist in a string中的方法查找所提及的任何国家/地区，然后从那里进行进一步处理。

像

这样的东西

countries = [
    "Afghanistan",
    "Albania",
    "Algeria",
    "Andorra",
    "Angola",
    "Anguilla",
    "Antigua",
    "Arabia",
    "Argentina",
    "Armenia",
    "Aruba",
    "Australia",
    "Austria",
    "Azerbaijan",
    "Bahamas",
    "Bahrain",
    "China",
    "Russia"
    # etc
]

def find_words_from_set_in_string(set_):
    set_ = set(set_)
    def words_in_string(s):
        return set_.intersection(s.split())
    return words_in_string

get_countries = find_words_from_set_in_string(countries)

然后

get_countries("The Embassy of China in Argentina is down the street from the Consulate of Russia")

返回

set(['Argentina', 'China', 'Russia'])

......显然需要更多的后期处理，但很快就会告诉您需要查找的内容。

正如链接文章所指出的那样，你必须警惕以标点符号结尾的单词 - 这可以通过s.split(" \t\r\n,.!?;:'\"")之类的东西来处理。您可能还想寻找形容词形式，即“俄语”，“中文”等。

Answer 2

您可以编译几组正则表达式，然后通过它们传递输入列表。就像是：导入重新

countries = ['foo', 'bar', 'baz']
takes = [re.compile(r'of\s+(the)?\s*%s$' % (c), re.I) for c in countries]
subs = [re.compile(r'%s$' % (c), re.I) for c in countries]

def remove_country(s):
    for regex in takes:
        if regex.search(s):
            return s
    for regex in subs:
        s = regex.sub('', s)
    return s

print remove_country('the bank of foo')
print remove_country('the bank of the baz')
print remove_country('the nation bar')

''' Output:
    the bank of foo
    the bank of the baz
    the nation
'''

在这里看起来比线性时间复杂度更快。至少你可以避免重复编译正则表达式一百万次并改善常数因子。

编辑：我有一些错别字，基本的想法是声音，它的工作原理。我添加了一个例子。

Answer 3

re.sub函数接受一个函数作为替换文本，调用该函数以获取应在给定匹配中替换的文本。所以你可以这样做：

import re

def make_regex(countries):
    escaped = (re.escape(country) for country in countries)
    states = '|'.join(escaped)
    return re.compile(r'\s+(of(\sthe)?\s)?(?P<state>{})'.format(states))

def remove_name(match):
    name = match.group()
    if name.lstrip().startswith('of'):
        return name
    else:
        return name.replace(match.group('state'), '').strip()

regex = make_regex(['China', 'Italy', 'America'])
regex.sub(remove_name, 'Embassy of China, International Italy').strip()
# result: 'Embassy of China, International'

结果可能包含一些虚假空间（在上面的例子中需要最后strip()）。你可以修改这个修改正则表达式：

\s*(of(\sthe)?\s)?(?P<state>({}))

捕捉of之前或国家/地区名称前的空格，并避免输出中的错误间距。

请注意，此解决方案可以处理整个文本，而不仅仅是Something of Country和Something Country形式的文本。例如：

In [38]: regex = make_regex(['China'])
    ...: text = '''This is more complex than just "Embassy of China" and "International China"'''

In [39]: regex.sub(remove_name, text)
Out[39]: 'This is more complex than just "Embassy of China" and "International"'

另一个示例用法：

In [33]: countries = [
    ...:     'China', 'India', 'Denmark', 'New York', 'Guatemala', 'Sudan',
    ...:     'France', 'Italy', 'Australia', 'New Zealand', 'Brazil', 
    ...:     'Canada', 'Japan', 'Vietnam', 'Middle-Earth', 'Russia',
    ...:     'Spain', 'Portugal', 'Argentina', 'San Marino'
    ...: ]

In [34]: template = 'Embassy of {0}, International {0}, Language of {0} is {0}, Government of {0}, {0} capital, Something {0} and something of the {0}.'

In [35]: text = 100 * '\n'.join(template.format(c) for c in countries)

In [36]: regex = make_regex(countries)
    ...: result = regex.sub(remove_name, text)

In [37]: result[:150]
Out[37]: 'Embassy of China, International, Language of China is, Government of China, capital, Something and something of the China.\nEmbassy of India, Internati'

Answer 4

未经测试：

def removeCountry(name):
    for country in countries:
          name =  re.sub('(?<!of (the )?)' + country + '$', '', name).strip()

使用负向lookbehind re.sub只匹配并替换当国家/地区之前没有

时

字符串中的替换条件

4 个答案: