从文本文件regex Python中读取并选择特定行

时间:2016-02-23 21:25:49

标签: python regex


Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT   (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT   (27kb)

Title: some_title 
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
  blablabla (this is a multiline abstract of the paper)

我想自动提取并存储(例如,作为列表)TitleAuthors和摘要(第二个和第三个\\之间的文本 - 请注意它从每个文本文件开始缩进)。另请注意,Date (revised)Title之间的白线确实存在(这不是我介绍的拼写错误。)


filename = os.listdir(path)[0]
test = pd.read_csv(filename, header=None, delimiter="\t")


0   ----------------------------------------------...
1                                                  \\
2                                 Paper: some_integer
3                          From: <some_email_address>
4         Date: Wed, 4 Apr 2001 12:08:13 GMT   (27kb)
5    Date (revised v2): Tue, 8 May 2001 10:39:33 G...
6                                Title: some_title...
7                             Authors: name_1, name_2
8                      Comments: 28 pages, JHEP latex
9                          Report-no: DUKE-CGTP-00-01
10                                                 \\
11                                          blabla...
12                                          blabla...
13                                          blabla...
14                                                 \\



但是它被截断了,它不是一个干净的字符串(一些属性显示),我发现这整个基于熊猫的方法实际上相当繁琐... 必须有一种更简单的方法来直接选择行使用正则表达式从文本文件中获得的兴趣。至少我希望如此...

4 个答案:

答案 0 :(得分:1)


with open("input.txt") as f:
    data = dict(line.strip().split(": ", 1) for line in f if ": " in line)


    'Comments': '28 pages, JHEP latex', 
    'Paper': 'some_integer', 
    'From': '<some_email_address>', 
    'Date (revised v2)': 'Tue, 8 May 2001 10:39:33 GMT   (27kb)', 
    'Title': 'some_title', 
    'Date': 'Wed, 4 Apr 2001 12:08:13 GMT   (27kb)', 
    'Authors': 'name_1, name_2'

答案 1 :(得分:1)


# -*- coding: utf-8> -*-
import re

string = """
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT   (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT   (27kb)

Title: some_title 
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
  blablabla (this is the abstract of the paper)

rx = re.compile(r"""
    ^Title:\s(?P<title>.+)[\n\r]        # Title at the beginning of a line
    Authors:\s(?P<authors>.+)[\n\r]     # Authors: ...
    Comments:\s(?P<comments>.+)[\n\r]   # ... and so on ...
    re.MULTILINE|re.VERBOSE)            # so that the caret matches any line
                                        # + verbose for this explanation

for match in rx.finditer(string):
    print match.group('title'), match.group('authors'), match.group('abstract')
    # some_title  name_1, name_2   blablabla (this is the abstract of the paper)

此方法将Title作为锚点(行的开头)并随后略读文本。命名组可能不是必需的,但使代码更容易理解。模式[\n\r]查找换行符 请参阅a demo on regex101.com

答案 2 :(得分:1)


Do While Not g_RS3.EOF
    With xlSheet.Cells(xlRow, xlCol)
        .Value = g_RS3("Label")
            .Offset(1, 0).Value = "Clients"
            .Offset(1, 1).Value = "Buyers"
                With .Offset(1, 0)
                    .Font.Bold = True
                .Borders.Weight = xlThin
            End With
            With .Offset(1, 1)
                .Font.Bold = True
                .Borders.Weight = xlThin
            End With
            With .Resize(1, 2)
                .Font.Bold = True
                .WrapText = True
                .VerticalAlignment = xlCenter
                .HorizontalAlignment = xlCenter
                .Borders.Weight = xlThin
            End With
    End With
    xlCol = xlCol + 2

    With xlSheet.Cells(xlRow, xlCol)
        .Value = "TOTAL"
            .Offset(1, 0).Value = "Clients"
            .Offset(1, 1).Value = "Buyers"
                With .Offset(1, 0)
                    .Font.Bold = True
                .Borders.Weight = xlThin
            End With
            With .Offset(1, 1)
                .Font.Bold = True
                .Borders.Weight = xlThin
            End With
            With .Resize(1, 2)
                .Font.Bold = True
                .WrapText = True
                .VerticalAlignment = xlCenter
                .HorizontalAlignment = xlCenter
                .Borders.Weight = xlThin
            End With
    End With

答案 3 :(得分:1)



假设&lt; txtfile.txt&#39;格式如上图所示。如果使用python 2.7x:

import re
with open('txtfile.txt', 'r') as f:
    input_string = f.read()
p = r'\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\'
print re.findall(p, input_string)


[('some_title', 'name_1, name_2', 'blablabla (this is a multiline abstract of the paper)\n  blablabla\n  blablabla')]