line1 = " The median income for a household in the city was $64,411, and the median income for a family was $78,940. The per capita income for the city was $22,466. About 4.3% of families and 5.9% of the population were below the poverty line, including 7.0% of those under age 18 and 12.3% of those age 65 or over."
line2 = " The median income for a household in the city was $31,893, and the median income for a family was $38,508. Males had a median income of $30,076 versus $20,275 for females. The per capita income for the city was $16,336. About 14.1% of families and 16.7% of the population were below the poverty line, including 21.8% of those under age 18 and 21.0% of those age 65 or over."
预期产出:
household median income: $64,411
family median income: $78,940
per capital income: $22,466
[householdIncome, familyIncome, perCapitalIncome] = re.findall("\d+,\d+",line1)
line1效果很好。 LINE2:
ValueError: too many values to unpack (expected 3)
主要目标是在找到关键词后如何识别第一个数字/值。
有些行他们没有人均收入,我可以接受为""答案 0 :(得分:2)
执行re.findall("\d+,\d+",line2)
的结果是['31,893', '38,508', '30,076', '20,275', '16,336']
。因此,直接的问题是正则表达式有五个结果,你只允许三个。但是,有一个稍微深一点的问题。当我检查这两个句子时,我发现它们有不同的结构。首先,家庭收入,家庭收入和人均收入确实似乎首先出现,但在第二句中似乎并非如此。我想说你需要提供一些更复杂的句子分析。
答案 1 :(得分:2)
正如其他人所指出的,你需要一些额外的编程逻辑。请考虑以下示例,该示例使用正则表达式查找有问题的值并在必要时计算中位数:
import re, locale
from locale import atoi
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )
lines = ["The median income for a household in the city was $64,411, and the median income for a family was $78,940. The per capita income for the city was $22,466. About 4.3% of families and 5.9% of the population were below the poverty line, including 7.0% of those under age 18 and 12.3% of those age 65 or over.",
"The median income for a household in the city was $31,893, and the median income for a family was $38,508. Males had a median income of $30,076 versus $20,275 for females. The per capita income for the city was $16,336. About 14.1% of families and 16.7% of the population were below the poverty line, including 21.8% of those under age 18 and 21.0% of those age 65 or over."]
# define the regex
rx = re.compile(r'''
(?P<type>household|family|per\ capita)
\D+
\$(?P<amount>\d[\d,]*\d)
(?:
\s+versus\s+
\$(?P<amount2>\d[\d,]*\d)
)?''', re.VERBOSE)
def afterwork(match):
if match.group('amount2'):
amount = (atoi(match.group('amount')) + atoi(match.group('amount2'))) / 2
else:
amount = atoi(match.group('amount'))
return amount
result = {}
for index, line in enumerate(lines):
result['line' + str(index)] = [(m.group('type'), afterwork(m)) for m in rx.finditer(line)]
print(result)
# {'line1': [('household', 31893), ('family', 38508), ('per capita', 16336)], 'line0': [('household', 64411), ('family', 78940), ('per capita', 22466)]}
答案 2 :(得分:0)
在第2行中,findall发现超过3个匹配项,并且您尝试仅在3个变量上解压缩它们。
使用类似的东西:
[householdIncome, familyIncome, perCapitalIncome] = re.findall("\d+,\d+",line1)[:3]