我有
FILE = open("file.txt", "r") #long text file
TEXT = FILE.read()
#long identification code with dots (.) and slashes (-)
regex = "process \d\d\d\d\d\d\d\-\d\d\.\d\d\d\d\.\d+\.\d\d\.\d\d\d\d"
SRC = re.findall(regex, TEXT, flags=re.IGNORECASE|re.MULTILINE)
如何在第一次出现SRC[i]
的第一个字符和下一个出现的第一个字符SRC[i+1]
之间获取文本,依此类推?无法找到任何直截了当的满意答案......
更多信息编辑:
pattern = 'process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}'
sample_input = "Process 1234567-89.1234.12431242.12.1234 - text title and long text description with no assured pattern Process 2234567-89.1234.12431242.12.1234 : chars and more text Process 3234567-89.1234.12431242.12.1234 - more text process 3234567-89.1234.12431242.12.1234 (...)"
sample_output[0] = "Process 1234567-89.1234.12431242.12.1234 - text title and long text description with no assured pattern "
sample_output[1] = "Process 2234567-89.1234.12431242.12.1234 : chars and more text "
sample_output[2] = "Process 3234567-89.1234.12431242.12.1234 - more text "
sample_output[3] = "process 3234567-89.1234.12431242.12.1234 "
答案 0 :(得分:1)
假设您有一个字符串some_str = 'abcARelevant_SubstringAcba'
,并且您希望字符串位于第一个A
和第二个A
之间;即期望的输出是'Relevant_Substring'
。
您可以使用以下行找到A
中some_str
出现次数的索引:
inds = [a.start() for a in re.finditer('A', some_str)]
现在inds = [3, 22]
。现在some_str[inds[0]+1:inds[1]
将包含'Relevant_Substring'
。
这应该可以扩展到您的问题。
编辑:这是一个具体的例子。
假设您有一个包含以下文本的文件“file.txt”:
Stuff I don't want.
0
Stuff I do want.
1
More stuff I don't want.
您想要使用所有数字(0-9)作为分隔符。因此,上面的0
和1
都将充当分隔符。请尝试以下代码:
import re
with open("file.txt", "r") as file:
data = file.read()
patt = re.compile('[0-9]')
inds = [a.start() for a in re.finditer(patt, data)]
print data[inds[0]+1:inds[1]]
这应打印出Stuff I do want.
答案 1 :(得分:1)
您可以使用此正则表达式:
(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*?)(?=Process)|(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*)
<强> Working demo 强>
)
匹配信息
MATCH 1
1. [0-105] `Process 1234567-89.1234.12431242.12.1234 - text title and long text description with no assured pattern `
MATCH 2
1. [105-168] `Process 2234567-89.1234.12431242.12.1234 : chars and more text `
MATCH 3
1. [168-221] `Process 3234567-89.1234.12431242.12.1234 - more text `
MATCH 4
2. [221-267] `Process 3234567-89.1234.12431242.12.1234 (...)`
您可以使用此代码:
sample_input = "Process 1234567-89.1234.12431242.12.1234 - text title and long text description with no assured pattern Process 2234567-89.1234.12431242.12.1234 : chars and more text Process 3234567-89.1234.12431242.12.1234 - more text process 3234567-89.1234.12431242.12.1234 (...)"
m = re.match(r"(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*?)(?=Process)|(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*)", sample_input)
m.group(1) # The first parenthesized subgroup.
m.groups() # Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern
答案 2 :(得分:0)
你不需要在两个字符之间找到一个字符串:
some_str = 'abcARelevant_SubstringAcba'
print some_str.split("A",2)[1]
Relevant_Substring