我正在尝试按以下格式清理文字:
['The first chunk of text \n 123 the stats I want (25% the percentage I want) \n The Second chunk of text \n 456 the second stats I want (50% the second percentage I want) \n The third chunk of text \n 789 the third stats I want (75% the third percentage) \n The fourth chunk of text \n 101 The fourth stats (100% the fourth percentage) \n'
]
我尝试过以下代码:
def cleanData(data):
first_line, second_line = data[0].split("\n")[:2]
print(first_line)
digit_match = re.search('\d+(?![\d%])', second_line)
if digit_match:
print(digit_match.group())
percent_match = re.search('\d+%', second_line)
if percent_match:
print(percent_match.group())
哪个适用于第一个实例,但似乎无法让它适用于所有实例。我一直在ValueError: not enough values to unpack
- 任何意见,将不胜感激!
答案 0 :(得分:0)
您可以尝试这样的事情:
拆分data
In [14]: lines = data[0].splitlines()
使用zip(iter, iter)
配方将数据分组为(text, stats)
In [15]: it = iter(lines)
In [16]: pairs = zip(it, it)
使用re.findall
的嵌套列表理解来获取第二部分的数字
In [17]: [(a,b,c) for (a, (b, c)) in ((A, re.findall(r"\d+", B)) for A, B in pairs)]
Out[17]:
[('The first chunk of text ', '123', '25'),
(' The Second chunk of text ', '456', '50'),
(' The third chunk of text ', '789', '75'),
(' The fourth chunk of text ', '101', '100')]
注意:这假设这两个数字是对的第二部分中唯一的数字部分。如果不是这种情况,您可以在列表推导中使用re.search
表达式的变体。
In [32]: [(A, re.search(r"(\d+).*\((\d+%).*\)", B).groups()) for A, B in pairs]
Out[32]:
[('The first chunk of text ', ('123', '25%')),
(' The Second chunk of text ', ('456', '50%')),
(' The third chunk of text ', ('789', '75%')),
(' The fourth chunk of text ', ('101', '100%'))]
当然你也可以将它放在一个循环中,这可能更具可读性,如果你想打印这些值,所有这些都更合适:
iterator = iter(data[0].splitlines())
for text in iterator:
stats = next(iterator)
digit, percent = re.search(r"(\d+).*\((\d+%).*\)", stats).groups()
print("{:<30} {:>5} {:>5}".format(text.strip(), digit, percent))
输出:
The first chunk of text 123 25%
The Second chunk of text 456 50%
The third chunk of text 789 75%
The fourth chunk of text 101 100%
更新:关于ValueError: not enough values to unpack
:看起来您的列表可能包含奇数个元素,可能是格式错误的数据,或者文档末尾有一个空行。在这种情况下,我的解决方案会产生类似的问题,但这些问题可以通过例如将加载的数据修剪为偶数行。