我有一个如下文字文件:
<Author>Marilyn1949
<Content>great way of doing things.
can you provide more info.blah blah blah..
<Date>Dec 1, 2008...
(file content continues in similar fashion for other authors)"
我正在尝试使用以下代码提取内容部分。你能帮我弄清楚我错过了什么,因为我的文件只是作为一个[]的数组生成。
text_file = open("output/out.txt", "w")
for file in os.listdir("./"):
if glob.fnmatch.fnmatch(file, '*.txt'):
with open(file, "r") as source:
L= source.read()
pattern = re.compile(r'<Content>*<Date>')
for match in L:
result = re.findall(r'<Content>.*<Date>', match)
text_file.write(str(result))
text_file.write('\n')
答案 0 :(得分:0)
点字符匹配除换行符之外的任何内容。使用re.DOTALL
flag也可以使其与换行符匹配:
result = re.findall(r'<Content>.*<Date>', match, flags=re.DOTALL)
此外,您可能不想捕获标记:
result = re.findall(r'<Content>(.*)<Date>', match, flags=re.DOTALL)
稍微清理你的例子:
with open(file, "r") as source:
results = re.findall(r'<Content>(.*?)<Date>', source.read(), flags=re.DOTALL)
text_file.write('\n'.join(results))