从文本文件regex Python中读取并选择特定行

时间:2016-02-23 21:25:49

标签: python regex

我有大量的文本文件可以在Python中读取。每个文件的结构如下:

------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT   (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT   (27kb)

Title: some_title 
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
  blablabla (this is a multiline abstract of the paper)
  blablabla
  blablabla
\\

我想自动提取并存储(例如,作为列表)TitleAuthors和摘要(第二个和第三个\\之间的文本 - 请注意它从每个文本文件开始缩进)。另请注意,Date (revised)Title之间的白线确实存在(这不是我介绍的拼写错误。)

到目前为止,我的尝试都涉及(我正在显示单个文本文件的步骤,比如列表中的第一个文件):

filename = os.listdir(path)[0]
test = pd.read_csv(filename, header=None, delimiter="\t")

这给了我:

                                                0
0   ----------------------------------------------...
1                                                  \\
2                                 Paper: some_integer
3                          From: <some_email_address>
4         Date: Wed, 4 Apr 2001 12:08:13 GMT   (27kb)
5    Date (revised v2): Tue, 8 May 2001 10:39:33 G...
6                                Title: some_title...
7                             Authors: name_1, name_2
8                      Comments: 28 pages, JHEP latex
9                          Report-no: DUKE-CGTP-00-01
10                                                 \\
11                                          blabla...
12                                          blabla...
13                                          blabla...
14                                                 \\

然后,我可以选择一个给定的行(例如,具有标题的行):

test[test[0].str.contains("Title")].to_string()

但是它被截断了,它不是一个干净的字符串(一些属性显示),我发现这整个基于熊猫的方法实际上相当繁琐... 必须有一种更简单的方法来直接选择行使用正则表达式从文本文件中获得的兴趣。至少我希望如此...

4 个答案:

答案 0 :(得分:1)

如何迭代文件中的每一行并按第一行:拆分(如果它在行中),在字典中收集拆分结果:

with open("input.txt") as f:
    data = dict(line.strip().split(": ", 1) for line in f if ": " in line)

因此,data将包含:

{
    'Comments': '28 pages, JHEP latex', 
    'Paper': 'some_integer', 
    'From': '<some_email_address>', 
    'Date (revised v2)': 'Tue, 8 May 2001 10:39:33 GMT   (27kb)', 
    'Title': 'some_title', 
    'Date': 'Wed, 4 Apr 2001 12:08:13 GMT   (27kb)', 
    'Authors': 'name_1, name_2'
}

答案 1 :(得分:1)

如果你的文件总是具有相同的结构,你可以提出:

# -*- coding: utf-8> -*-
import re

string = """
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT   (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT   (27kb)

Title: some_title 
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
  blablabla (this is the abstract of the paper)
\\
"""

rx = re.compile(r"""
    ^Title:\s(?P<title>.+)[\n\r]        # Title at the beginning of a line
    Authors:\s(?P<authors>.+)[\n\r]     # Authors: ...
    Comments:\s(?P<comments>.+)[\n\r]   # ... and so on ...
    .*[\n\r]
    (?P<abstract>.+)""", 
    re.MULTILINE|re.VERBOSE)            # so that the caret matches any line
                                        # + verbose for this explanation

for match in rx.finditer(string):
    print match.group('title'), match.group('authors'), match.group('abstract')
    # some_title  name_1, name_2   blablabla (this is the abstract of the paper)

此方法将Title作为锚点(行的开头)并随后略读文本。命名组可能不是必需的,但使代码更容易理解。模式[\n\r]查找换行符 请参阅a demo on regex101.com

答案 2 :(得分:1)

你可以逐行处理。

Do While Not g_RS3.EOF
    With xlSheet.Cells(xlRow, xlCol)
        .Value = g_RS3("Label")
            .Offset(1, 0).Value = "Clients"
            .Offset(1, 1).Value = "Buyers"
                With .Offset(1, 0)
                    .Font.Bold = True
                .Borders.Weight = xlThin
            End With
            With .Offset(1, 1)
                .Font.Bold = True
                .Borders.Weight = xlThin
            End With
            With .Resize(1, 2)
                .Font.Bold = True
                .WrapText = True
                .VerticalAlignment = xlCenter
                .Merge
                .HorizontalAlignment = xlCenter
                .Borders.Weight = xlThin
            End With
    End With
    xlCol = xlCol + 2
    g_RS3.MoveNext
Loop

    With xlSheet.Cells(xlRow, xlCol)
        .Value = "TOTAL"
            .Offset(1, 0).Value = "Clients"
            .Offset(1, 1).Value = "Buyers"
                With .Offset(1, 0)
                    .Font.Bold = True
                .Borders.Weight = xlThin
            End With
            With .Offset(1, 1)
                .Font.Bold = True
                .Borders.Weight = xlThin
            End With
            With .Resize(1, 2)
                .Font.Bold = True
                .WrapText = True
                .VerticalAlignment = xlCenter
                .Merge
                .HorizontalAlignment = xlCenter
                .Borders.Weight = xlThin
            End With
    End With

答案 3 :(得分:1)

此模式将帮助您入门:

\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\

假设&lt; txtfile.txt&#39;格式如上图所示。如果使用python 2.7x:

import re
with open('txtfile.txt', 'r') as f:
    input_string = f.read()
p = r'\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\'
print re.findall(p, input_string)

输出:

[('some_title', 'name_1, name_2', 'blablabla (this is a multiline abstract of the paper)\n  blablabla\n  blablabla')]