Question

我试图在带有正则表达式的句子中定位项目（其中一个是另一个的子串），但是它总是定位子串。例如，有两个项目[“ The Duke”，“ A Duke of””和一些句子：

公爵

公爵是电影。

电影《公爵》怎么样？

A公爵

A公爵是电影。

电影《阿公爵》怎么样？

找到位置后我想要的是：

The_Duke

The_Duke是电影。

电影The_Duke怎么样？

The_Duke_of_A

The_Duke_of_A是电影。

电影The_Duke_of_A怎么样？

我尝试过的代码是：

for sent in sentences:
    for item in ["The Duke", "The Duke of A"]:
        find = re.search(r'{0}'.format(item), sent)
        if find:
           sent = sent.replace(sent[find.start():find.end()], item.replace(" ", "_"))

但是我得到了

The_Duke

The_Duke是电影。

电影The_Duke怎么样？

A的公爵

A的公爵是电影。

电影《 The_Duke of A》怎么样？

更改列表中项目的位置不适合我的情况，因为我的列表很大（超过10,000个项目）。

Answer 1

您可以使用re.sub，而repl可以是一个函数，因此只需替换结果中的空格即可。

import re

with open("filename.txt") as sentences:
    for line in sentences:
        print(re.sub(r"The Duke of A|The Duke",
                     lambda s: s[0].replace(' ', '_'),
                     line))

结果是：

The_Duke

The_Duke is a movie.

How is the movie The_Duke?

The_Duke_of_A

The_Duke_of_A is a movie.

How is the movie The_Duke_of_A?

Answer 2

您正在做的是首先寻找“公爵”。如果重新找到任何匹配项，则将其替换为“ The_Duke”。现在，循环的第二遍正在寻找“ A的公爵”，但由于您之前已进行更改，因此找不到任何匹配项。

这应该有效。

for sent in sentences:
for item in ["The Duke of A", "The Duke"]:
    find = re.search(r'{0}'.format(item), sent)
    if find:
       sent = sent.replace(sent[find.start():find.end()], item.replace(" ", "_"))

Answer 3

如果无法更改列表中项目的位置，则可以尝试此版本。在第一遍中，我们收集所有匹配项，在第二遍中，我们进行替换：

data = '''The Duke
The Duke is a movie.
How is the movie The Duke?
The Duke of A
The Duke of A is a movie.
How is the movie The Duke of A?'''

terms = ["The Duke", "The Duke of A"]

import re

to_change = []
for t in terms:
    for g in re.finditer(t, data):
        to_change.append((g.start(), g.end()))

for (start, end) in to_change:
    data = data[:start] + re.sub(r'\s', r'_', data[start:end]) + data[end:]

print(data)

打印：

The_Duke
The_Duke is a movie.
How is the movie The_Duke?
The_Duke_of_A
The_Duke_of_A is a movie.
How is the movie The_Duke_of_A?

Answer 4

将“ A公爵”和“公爵”的位置互换：

for item in ["The Duke", "The Duke of A"]:

成为

for item in ["The Duke of A", "The Duke"]:

如何在句子中找到字符串和子字符串

4 个答案: