Question

我正在处理数百个文档，并且正在编写一个函数，该函数将查找特定的单词及其值并返回字典列表。

我正在专门寻找一条特定的信息（“城市”和引用该信息的数字）。但是，在某些文档中，我有一个城市，而在另一些文档中，我可能有20个甚至一百个城市，因此我需要一些非常通用的东西。

一个文本示例（括号被这样弄乱了）：

text = 'The territory of modern Hungary was for centuries inhabited by a succession of peoples, including Celts, Romans, Germanic tribes, Huns, West Slavs and the Avars. The foundations of the Hungarian state was established in the late ninth century AD by the Hungarian grand prince Árpád following the conquest of the Carpathian Basin. According to previous census City: Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). However etc etc'

或

text2 = 'About medium-sized cities such as City: Eger (population was: 32,352). However etc etc'

使用正则表达式，我找到了所需的字符串：

p = regex.compile(r'(?<=City).(.*?)(?=However)')
m = p.findall(text)

将全部文本作为列表返回。

[' Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). ']

现在，这是我遇到的问题，我不知道如何进行。我应该使用regex.findall还是regex.finditer？

考虑到文档中“城市”的数量各不相同，我想找一本字典清单。如果我输入文字2，我会得到：

d = [{'cities': 'Eger', 'population': '32,352'}]

如果我输入文字一：

d = [{'cities': 'Szeged', 'population': '104,867'}, {'cities': 'Miskolc': 'population': 109,841'}]

我真的很感谢大家的帮助！

Answer 1

您可以将re.finditer与正则表达式结合使用，并用x.groupdict()在匹配文本上命名捕获组（以键命名），以得到结果字典：

import re
text = 'The territory of modern Hungary was for centuries inhabited by a succession of peoples, including Celts, Romans, Germanic tribes, Huns, West Slavs and the Avars. The foundations of the Hungarian state was established in the late ninth century AD by the Hungarian grand prince Árpád following the conquest of the Carpathian Basin. According to previous census City: Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). However etc etc'
p = re.compile(r'City:\s*(.*?)However')
p2 = re.compile(r'(?P<city>\w+)\s*\([^()\d]*(?P<population>\d[\d,]*)')
m = p.search(text)
if m:
    print([x.groupdict() for x in p2.finditer(m.group(1))])

# => [{'population': '1,590,316', 'city': 'Budapest'}, {'population': '115,399', 'city': 'Debrecen'}, {'population': '104,867', 'city': 'Szeged'}, {'population': '109,841', 'city': 'Miskolc'}]

请参见Python 3 demo online。

第二个p2正则表达式是

(?P<city>\w+)\s*\([^()\d]*(?P<population>\d[\d,]*)

请参见regex demo。

在这里

(?P<city>\w+)-组“城市”：1个以上的字符字符
\s*\(-0 +空格和(
[^()\d]*-除(和)和数字之外的任何0+个字符
(?P<population>\d[\d,]*)-组“人口”：一个数字，后跟0+个数字或/和逗号

您可能会尝试在整个原始字符串上运行p2正则表达式（请参阅demo），但可能会过度匹配。

Answer 2

@Wiktor的一个很好的答案。由于我花了一些时间在此，因此我发布了答案。

d = [' Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). ']
oo = []
import re
for i in d[0].split(")"):
    jj = re.search("[0-9,]+", i)
    kk, *xx = i.split()
    if jj:
        oo.append({"cities": kk , "population": jj.group()})
print (oo)

#Result--> [{'cities': 'Budapest', 'population': '1,590,316'}, {'cities': 'Debrecen', 'population': '115,399'}, {'cities': 'Szeged', 'population': '104,867'}, {'cities': 'Miskolc', 'population': '109,841'}]

如何从regex.findall的匹配中返回字典列表？

2 个答案:

@Wiktor的一个很好的答案。由于我花了一些时间在此，因此我发布了答案。