新行前任意数量的单词的正则表达式

时间:2017-03-23 13:35:56

标签: python regex

我在段落中解析了一些文本,我想将其拆分为插入表格。

字符串如下:

["Some text unsure how many numbers or if any special charectors etc. But I don't really care I just want all the text in this string \n 123 some more text (50% and some more text) \n"]

我想要做的是将新行之前的第一个文本字符串拆分出来 - 不管是什么。我开始尝试这个[A-Za-z]*\s*[A-Za-z]*\s*,但很快意识到这不会削减它,因为这个字符串中的文本是可变的。

然后我想取第二个字符串中的数字,以下似乎是这样做的:

\d+

然后最后我想得到第二个字符串中的百分比,以下似乎适用于:

\d+(%)+

我计划在一个函数中使用它们,但是我很难为第一部分编译正则表达式?我还想知道我对前两部分的正则表达式是否效率最高?

更新:希望这会让它更清晰一点?

输入:

[‘ The first chunk of text \n 123 the stats I want (25% the percentage I want) \n The Second chunk of text \n 456 the second stats I want (50% the second percentage I want) \n The third chunk of text \n 789 the third stats I want (75% the third percentage) \n The fourth chunk of text \n 101 The fourth stats (100% the fourth percentage) \n]

期望的输出: enter image description here

1 个答案:

答案 0 :(得分:2)

第2行

您可以使用split获取前两行:

import re

data = ["Some text unsure how many numbers or if any special charectors etc. But I don't really care I just want all the text in this string \n 123 some more text (50% and some more text) \n"]

first_line, second_line = data[0].split("\n")[:2]
print first_line
# Some text unsure how many numbers or if any special charectors etc. But I don't really care I just want all the text in this string

digit_match = re.search('\d+(?![\d%])', second_line)
if digit_match:
    print digit_match.group()
    # 123

percent_match = re.search('\d+%', second_line)
if percent_match:
    print percent_match.group()
    # 50%

请注意,如果百分比是在另一个数字之前写入的,则\d+将与百分比匹配(不包含%)。我添加了negative-lookahead以确保匹配的号码后面没有数字或%

每一对

如果你想继续解析成对的行:

data = [" The first chunk of text \n 123 the stats I want (25% the percentage I want) \n The Second chunk of text \n 456 the second stats I want (50% the second percentage I want) \n The third chunk of text \n 789 the third stats I want (75% the third percentage) \n The fourth chunk of text \n 101 The fourth stats (100% the fourth percentage) \n"]

import re

lines = data[0].strip().split("\n")

# TODO: Make sure there's an even number of lines
for i in range(0, len(lines), 2):
    first_line, second_line = lines[i:i + 2]

    print first_line

    digit_match = re.search('\d+(?![\d%])', second_line)
    if digit_match:
        print digit_match.group()

    percent_match = re.search('\d+%', second_line)
    if percent_match:
        print percent_match.group()

输出:

The first chunk of text 
123
25%
 The Second chunk of text 
456
50%
 The third chunk of text 
789
75%
 The fourth chunk of text 
101
100%