python从大文本文件中提取文本描述

时间:2015-02-22 19:27:00

标签: python

我有bigg文本文件,我需要提取描述信息:

#### **Description**

20_Ways_To_Make_100_Dollars_EVERYDAY !!!  
High Quality Guide (PDF File)  
Here; I will teach you how to make 100 dollars every, or may be even more!  
Buy the guide to get this secret method. ! worth more than you pay!  
Good luck to everyone!



#### **Ships To**

Worldwide

开始“描述”完成“#### 发送到”,我该如何制作这个惠特蟒蛇?我需要这个输出:

20_Ways_To_Make_100_Dollars_EVERYDAY !!!  
High Quality Guide (PDF File)  
Here; I will teach you how to make 100 dollars every, or may be even more!  
Buy the guide to get this secret method. ! worth more than you pay!  
Good luck to everyone!

3 个答案:

答案 0 :(得分:1)

假设您在'####'之后的消息中有更多种类,我建议您在解析文件时使用更严格的格式标准:

import re #regular expressions module

file = open('text_to_process.txt', 'r') #opening your file

text = file.readlines()

file.close()

flag = False #flag to mark start/end of description

for line in text:
    if re.match(r"#### \*\*Description\*\*", line):
        flag = True
        continue
    if flag: 
        if not re.match("####", line):
            print(line.strip()) #just printing the line, alternatively you could write it into file or variable
        else:
            flag = False

答案 1 :(得分:0)

如果您知道标题的确切外观,请尝试:

In_description = false
Part = ""
For line in file:
    If not in_description:
        In_description = '**Description**' in line
    If in_description:
        In_description = not '**Ships to**' in line
        If in_description:
            Part += line

对于某些大写错误道歉,我在手机上。这段代码的作用是(假设你有一个打开的文件),读取每行看起来将in_description变为true。如果是,请确保它不是最后一行,如果不是,则将该行写入该文件。我不在线,所以如果你需要一个' / n'我不是百分百肯定的。在行尾(即如果你需要"部分+' / n'"),但如果它全部出现在一行中,那么你需要它。我建议将这些常量更改为尽可能具体,包括一些#s。

答案 2 :(得分:0)

  • 遍历文件,直至找到Description行,然后
  • 打印行,直到找到Ships To

with open('data', 'r') as f:
    # iterate through f until Description line found
    for line in f:
        if line.startswith('#### **Description**'):
            break
    # print lines until Ships To line is found
    for line in f:
        if line.startswith('#### **Ships To**'):
            break
        print(line)

break terminates the for-loop。但由于fiterator,因此下一个for-loop从另一个for-loop停止的地方开始。因此,两个for-loop只在一起传递文件。