在两段文本之间提取文本

时间:2017-01-31 10:37:34

标签: python regex python-3.x text-extraction

我尝试使用Python在以下标题之间提取文字:

@HEADER1
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
@othertext

@HEADER1 + @othertext的确切文字可能会随着时间而改变。所以我需要变得充满活力。

此外,HEADER2是一个以'@'开头的单词。那么我可以使用startswith函数吗?还是正则表达式?

类似的东西。

For line in file:
    if(line == 'HEADER1'):
        print next line
        continue = TRUE
    if(continue == TRUE):
        print(line)
    elif(line == othertext):
        break

4 个答案:

答案 0 :(得分:4)

这可以完成工作

import re

string = """@HEADER1
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
@othertext"""

print '"{}"'.format(re.split(r'(@HEADER1[\n\r]|[\n\r]@othertext)', string)[2])

输出:

"ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe"

答案 1 :(得分:2)

看起来像这样?

import re

string = """@HEADER1
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
@othertext
@HEADER2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
@othertext"""

for a in re.findall(r'@\w+(?:\r\n|\r|\n)(.*?)@\w+(?:\r\n|\r|\n)?', string, re.DOTALL):
    print a

输出:

ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe

ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2

答案 2 :(得分:0)

没有重新

string = """@HEADER1
    ExtractMe
    ExtractMe
    ExtractMe
    ExtractMe
    ExtractMe
    ExtractMe
    ExtractMe
    ExtractMe
    ExtractMe
    @othertext"""

您可以在字符串拼接中使用str.find。像这样:

print(string[string.find("\n"):string.find("\n@")])

或者您可以将字符串转换为列表,获取所需的元素并将其重新连接在一起......

list = string.split("\n")
list = list[1:len(list)-1]
print("\n".join(list))

答案 3 :(得分:0)

我在这种情况下使用partition()方法

<style name="MaterialComponentsThemeBlueAvailableDates" parent="MaterialComponentsTheme">
    <item name="materialCalendarTheme">@style/OurMaterialCalendar</item>
    <item name="mtrl_picker_cancel">Overridden value</item>
</style>

输出:

text_to_extract = "@HEADER1\nExtractMe\nExtractMe\nExtractMe\nExtractMe\nExtractMe\nExtractMe\nExtractMe\nExtractMe\nExtractMe\n@othertext"
extracted = text_to_extract.partition('@HEADER1')[2].partition('@othertext')[0]
print (extracted)