单独的文本python块

时间:2018-05-30 00:52:02

标签: python text block

我想知道如何在同一文本文件中分隔文本块。示例如下。基本上我有两个项目,一个来自"第9频道"与"简要:...",另一个以"南方..."开头。再来一次," Brief"线。如何用python将它们分成2个文本文件?我认为公共分频器是"(女16 +)"。非常感谢!

Channel 9 (1 item)

A woman selling her caravan near Bendigo has been left 
$1,100 out
hosted by Peter Hitchener
A woman selling her caravan near Bendigo has been left $1,100 out of 
pocket after an elderly couple made the purchase with counterfeit money. 
The wildlife worker tried to use the notes to pay for a house deposit, but an 
agent noticed the notes were missing the Coat of Arms on one side. 


Brief: Radio & TV
Demographics: 153,000 (male 16+) • 177,000 (female 
16+)

Southern Cross Victoria Bendigo (1 item)


Heathcote Police are warning the residents to be on the 
lookout a
hosted by Jo Hall
Heathcote Police are warning the residents to be on the lookout after a large 
dash of fake $50 note was discovered. Victim Marianne Thomas was given 
counterfeit notes from a caravan. The Heathcote resident tried to pay the 
house deposit and that's when the counterfeit notes were spotted. Thomas 
says the caravan is in town for the Spanish Festival.


Brief: Radio & TV
Demographics: 4,000 (male 16+) • 3,000 (female 16+)

4 个答案:

答案 0 :(得分:2)

这是我最近做过的类似事情的修改示例,基本上是通过你的文本和逐行复制。核心逻辑基于附加到当前文件名,该文件名在找到新部分后重置。将使用下一节的第一行作为文件名。

#!/usr/bin/env python
import re

data = """
Channel 9 (1 item)

A woman selling her caravan near Bendigo has been left $1,100 out hosted by
Peter Hitchener A woman selling her caravan near Bendigo has been left $1,100
out of pocket after an elderly couple made the purchase with counterfeit money.
The wildlife worker tried to use the notes to pay for a house deposit, but an
agent noticed the notes were missing the Coat of Arms on one side.

Brief: Radio & TV Demographics: 153,000 (male 16+) • 177,000 (female 16+)

Southern Cross Victoria Bendigo (1 item)

Heathcote Police are warning the residents to be on the lookout a hosted by Jo
Hall Heathcote Police are warning the residents to be on the lookout after a
large dash of fake $50 note was discovered. Victim Marianne Thomas was given
counterfeit notes from a caravan. The Heathcote resident tried to pay the house
deposit and that's when the counterfeit notes were spotted. Thomas says the
caravan is in town for the Spanish Festival.

Brief: Radio & TV Demographics: 4,000 (male 16+) • 3,000 (female 16+)
"""



current_file = None
for line in data.split('\n'):

    # Set initial filename
    if current_file == None and line != '':
        current_file = line + '.txt'

    # This is to handle the blank line after Brief
    if current_file == None:
        continue

    text_file = open(current_file, "a")
    text_file.write(line + "\n")
    text_file.close()

    # Reset filename if we have finished this section
    # which is idenfitied by:
    #    starts with Brief - ^Brief
    #    contains some random amount of text - .*
    #    ends with ) - )$
    if re.match(r'^Brief:.*\)$', line) is not None:
        current_file = None

这将输出以下文件

Channel 9 (1 item).txt
Southern Cross Victoria Bendigo (1 item).txt

答案 1 :(得分:1)

实际上,我怀疑你确实想要在以Demographics:开头的链接之后,或者在以(1 item)(2 items)或类似结尾的行之前中断。

但是你要打破局面,有两个步骤:

  1. 提出一条规则,您可以将其转换为您在每一行上调用的函数。
  2. 根据该函数的结果编写一些对事物进行分组的代码。
  3. 让我们使用您的规则。其功能可能是:

    def is_last_line(line):
        return line.strip().endswith('(female 16+)')
    

    现在,您可以使用该功能对事物进行分组:

    i = 1
    outfile = open(f'outfile{i}.txt', 'w')
    for line in infile:
        outfile.write(line.strip())
        if is_last_line(line):
            i += 1
            outfile = open(f'outfile{i}.txt', 'w')
    outfile.close()
    

    通过使用例如itertools.groupbyitertools.takewhileiter或其他功能,您可以通过各种方式更加简洁。或者你可以编写一个仍然手动执行的生成器函数,但yield的行组,这将允许创建新文件更简单(让我们使用with块)。但是这样明确可能会让初学者更容易理解(以及调试,并在以后进行扩展),但需要花费一些冗长的代价。

    例如,从您提出问题的方式来看,您是否确实希望Demographics:行显示在输出文件中并不是很清楚。如果你不这样做,应该明白如何改变一切:

        if not is_last_line(line):
            outfile.write(line.strip())
        else:
            i += 1
            outfile = open(f'outfile{i}.txt', 'w')
    

答案 2 :(得分:1)

这是硬编码的一部分,可以完成这项工作:

s = """Channel 9 (1 item)

A woman selling her caravan near Bendigo has been left $1,100 out hosted by Peter Hitchener A woman selling her caravan near Bendigo has been left $1,100 out of pocket after an elderly couple made the purchase with counterfeit money. The wildlife worker tried to use the notes to pay for a house deposit, but an agent noticed the notes were missing the Coat of Arms on one side.

Brief: Radio & TV Demographics: 153,000 (male 16+) • 177,000 (female 16+)

Southern Cross Victoria Bendigo (1 item)

Heathcote Police are warning the residents to be on the lookout a hosted by Jo Hall Heathcote Police are warning the residents to be on the lookout after a large dash of fake $50 note was discovered. Victim Marianne Thomas was given counterfeit notes from a caravan. The Heathcote resident tried to pay the house deposit and that's when the counterfeit notes were spotted. Thomas says the caravan is in town for the Spanish Festival.

Brief: Radio & TV Demographics: 4,000 (male 16+) • 3,000 (female 16+)"""

part_1 = s[s.index("Channel 9"):s.index("Southern Cross")]

part_2 = s[s.index("Southern Cross"):]

然后将它们保存到文件中。

答案 3 :(得分:1)

看起来以" 受众特征:"开头的行充当真正的分隔符。我会用两种方式使用正则表达式:首先,用这些行分割文本;第二,自己提取这些行。然后可以将结果组合起来重建块:

import re
DIVIDER = 'Demographics: .+' # Make it tunable, in case you change your mind
blocks_1 = re.split(DIVIDER, text)
blocks_2 = re.findall(DIVIDER, text)
blocks = ['\n\n'.join(pair) for pair in zip(blocks_1, blocks_2)
blocks[0]
#Channel 9 (1 item)\n\nA woman selling her caravan near ... 
#... Demographics: 153,000 (male 16+) • 177,000 (female 16+)