我有大量的文本文件可以在Python中读取。每个文件的结构如下:
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT (27kb)
Title: some_title
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
blablabla (this is a multiline abstract of the paper)
blablabla
blablabla
\\
我想自动提取并存储(例如,作为列表)Title
,Authors
和摘要(第二个和第三个\\
之间的文本 - 请注意它从每个文本文件开始缩进)。另请注意,Date (revised)
和Title
之间的白线确实存在(这不是我介绍的拼写错误。)
到目前为止,我的尝试都涉及(我正在显示单个文本文件的步骤,比如列表中的第一个文件):
filename = os.listdir(path)[0]
test = pd.read_csv(filename, header=None, delimiter="\t")
这给了我:
0
0 ----------------------------------------------...
1 \\
2 Paper: some_integer
3 From: <some_email_address>
4 Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
5 Date (revised v2): Tue, 8 May 2001 10:39:33 G...
6 Title: some_title...
7 Authors: name_1, name_2
8 Comments: 28 pages, JHEP latex
9 Report-no: DUKE-CGTP-00-01
10 \\
11 blabla...
12 blabla...
13 blabla...
14 \\
然后,我可以选择一个给定的行(例如,具有标题的行):
test[test[0].str.contains("Title")].to_string()
但是它被截断了,它不是一个干净的字符串(一些属性显示),我发现这整个基于熊猫的方法实际上相当繁琐... 必须有一种更简单的方法来直接选择行使用正则表达式从文本文件中获得的兴趣。至少我希望如此...
答案 0 :(得分:1)
如何迭代文件中的每一行并按第一行:
拆分(如果它在行中),在字典中收集拆分结果:
with open("input.txt") as f:
data = dict(line.strip().split(": ", 1) for line in f if ": " in line)
因此,data
将包含:
{
'Comments': '28 pages, JHEP latex',
'Paper': 'some_integer',
'From': '<some_email_address>',
'Date (revised v2)': 'Tue, 8 May 2001 10:39:33 GMT (27kb)',
'Title': 'some_title',
'Date': 'Wed, 4 Apr 2001 12:08:13 GMT (27kb)',
'Authors': 'name_1, name_2'
}
答案 1 :(得分:1)
如果你的文件总是具有相同的结构,你可以提出:
# -*- coding: utf-8> -*-
import re
string = """
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT (27kb)
Title: some_title
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
blablabla (this is the abstract of the paper)
\\
"""
rx = re.compile(r"""
^Title:\s(?P<title>.+)[\n\r] # Title at the beginning of a line
Authors:\s(?P<authors>.+)[\n\r] # Authors: ...
Comments:\s(?P<comments>.+)[\n\r] # ... and so on ...
.*[\n\r]
(?P<abstract>.+)""",
re.MULTILINE|re.VERBOSE) # so that the caret matches any line
# + verbose for this explanation
for match in rx.finditer(string):
print match.group('title'), match.group('authors'), match.group('abstract')
# some_title name_1, name_2 blablabla (this is the abstract of the paper)
此方法将Title
作为锚点(行的开头)并随后略读文本。命名组可能不是必需的,但使代码更容易理解。模式[\n\r]
查找换行符
请参阅a demo on regex101.com。
答案 2 :(得分:1)
你可以逐行处理。
Do While Not g_RS3.EOF
With xlSheet.Cells(xlRow, xlCol)
.Value = g_RS3("Label")
.Offset(1, 0).Value = "Clients"
.Offset(1, 1).Value = "Buyers"
With .Offset(1, 0)
.Font.Bold = True
.Borders.Weight = xlThin
End With
With .Offset(1, 1)
.Font.Bold = True
.Borders.Weight = xlThin
End With
With .Resize(1, 2)
.Font.Bold = True
.WrapText = True
.VerticalAlignment = xlCenter
.Merge
.HorizontalAlignment = xlCenter
.Borders.Weight = xlThin
End With
End With
xlCol = xlCol + 2
g_RS3.MoveNext
Loop
With xlSheet.Cells(xlRow, xlCol)
.Value = "TOTAL"
.Offset(1, 0).Value = "Clients"
.Offset(1, 1).Value = "Buyers"
With .Offset(1, 0)
.Font.Bold = True
.Borders.Weight = xlThin
End With
With .Offset(1, 1)
.Font.Bold = True
.Borders.Weight = xlThin
End With
With .Resize(1, 2)
.Font.Bold = True
.WrapText = True
.VerticalAlignment = xlCenter
.Merge
.HorizontalAlignment = xlCenter
.Borders.Weight = xlThin
End With
End With
答案 3 :(得分:1)
此模式将帮助您入门:
\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\
假设&lt; txtfile.txt&#39;格式如上图所示。如果使用python 2.7x:
import re
with open('txtfile.txt', 'r') as f:
input_string = f.read()
p = r'\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\'
print re.findall(p, input_string)
输出:
[('some_title', 'name_1, name_2', 'blablabla (this is a multiline abstract of the paper)\n blablabla\n blablabla')]