我想知道如何在同一文本文件中分隔文本块。示例如下。基本上我有两个项目,一个来自"第9频道"与"简要:...",另一个以"南方..."开头。再来一次," Brief"线。如何用python将它们分成2个文本文件?我认为公共分频器是"(女16 +)"。非常感谢!
Channel 9 (1 item)
A woman selling her caravan near Bendigo has been left
$1,100 out
hosted by Peter Hitchener
A woman selling her caravan near Bendigo has been left $1,100 out of
pocket after an elderly couple made the purchase with counterfeit money.
The wildlife worker tried to use the notes to pay for a house deposit, but an
agent noticed the notes were missing the Coat of Arms on one side.
Brief: Radio & TV
Demographics: 153,000 (male 16+) • 177,000 (female
16+)
Southern Cross Victoria Bendigo (1 item)
Heathcote Police are warning the residents to be on the
lookout a
hosted by Jo Hall
Heathcote Police are warning the residents to be on the lookout after a large
dash of fake $50 note was discovered. Victim Marianne Thomas was given
counterfeit notes from a caravan. The Heathcote resident tried to pay the
house deposit and that's when the counterfeit notes were spotted. Thomas
says the caravan is in town for the Spanish Festival.
Brief: Radio & TV
Demographics: 4,000 (male 16+) • 3,000 (female 16+)
答案 0 :(得分:2)
这是我最近做过的类似事情的修改示例,基本上是通过你的文本和逐行复制。核心逻辑基于附加到当前文件名,该文件名在找到新部分后重置。将使用下一节的第一行作为文件名。
#!/usr/bin/env python
import re
data = """
Channel 9 (1 item)
A woman selling her caravan near Bendigo has been left $1,100 out hosted by
Peter Hitchener A woman selling her caravan near Bendigo has been left $1,100
out of pocket after an elderly couple made the purchase with counterfeit money.
The wildlife worker tried to use the notes to pay for a house deposit, but an
agent noticed the notes were missing the Coat of Arms on one side.
Brief: Radio & TV Demographics: 153,000 (male 16+) • 177,000 (female 16+)
Southern Cross Victoria Bendigo (1 item)
Heathcote Police are warning the residents to be on the lookout a hosted by Jo
Hall Heathcote Police are warning the residents to be on the lookout after a
large dash of fake $50 note was discovered. Victim Marianne Thomas was given
counterfeit notes from a caravan. The Heathcote resident tried to pay the house
deposit and that's when the counterfeit notes were spotted. Thomas says the
caravan is in town for the Spanish Festival.
Brief: Radio & TV Demographics: 4,000 (male 16+) • 3,000 (female 16+)
"""
current_file = None
for line in data.split('\n'):
# Set initial filename
if current_file == None and line != '':
current_file = line + '.txt'
# This is to handle the blank line after Brief
if current_file == None:
continue
text_file = open(current_file, "a")
text_file.write(line + "\n")
text_file.close()
# Reset filename if we have finished this section
# which is idenfitied by:
# starts with Brief - ^Brief
# contains some random amount of text - .*
# ends with ) - )$
if re.match(r'^Brief:.*\)$', line) is not None:
current_file = None
这将输出以下文件
Channel 9 (1 item).txt
Southern Cross Victoria Bendigo (1 item).txt
答案 1 :(得分:1)
实际上,我怀疑你确实想要在以Demographics:
开头的链接之后,或者在以(1 item)
或(2 items)
或类似结尾的行之前中断。
但是你要打破局面,有两个步骤:
让我们使用您的规则。其功能可能是:
def is_last_line(line):
return line.strip().endswith('(female 16+)')
现在,您可以使用该功能对事物进行分组:
i = 1
outfile = open(f'outfile{i}.txt', 'w')
for line in infile:
outfile.write(line.strip())
if is_last_line(line):
i += 1
outfile = open(f'outfile{i}.txt', 'w')
outfile.close()
通过使用例如itertools.groupby
,itertools.takewhile
,iter
或其他功能,您可以通过各种方式更加简洁。或者你可以编写一个仍然手动执行的生成器函数,但yield
的行组,这将允许创建新文件更简单(让我们使用with
块)。但是这样明确可能会让初学者更容易理解(以及调试,并在以后进行扩展),但需要花费一些冗长的代价。
例如,从您提出问题的方式来看,您是否确实希望Demographics:
行显示在输出文件中并不是很清楚。如果你不这样做,应该明白如何改变一切:
if not is_last_line(line):
outfile.write(line.strip())
else:
i += 1
outfile = open(f'outfile{i}.txt', 'w')
答案 2 :(得分:1)
这是硬编码的一部分,可以完成这项工作:
s = """Channel 9 (1 item)
A woman selling her caravan near Bendigo has been left $1,100 out hosted by Peter Hitchener A woman selling her caravan near Bendigo has been left $1,100 out of pocket after an elderly couple made the purchase with counterfeit money. The wildlife worker tried to use the notes to pay for a house deposit, but an agent noticed the notes were missing the Coat of Arms on one side.
Brief: Radio & TV Demographics: 153,000 (male 16+) • 177,000 (female 16+)
Southern Cross Victoria Bendigo (1 item)
Heathcote Police are warning the residents to be on the lookout a hosted by Jo Hall Heathcote Police are warning the residents to be on the lookout after a large dash of fake $50 note was discovered. Victim Marianne Thomas was given counterfeit notes from a caravan. The Heathcote resident tried to pay the house deposit and that's when the counterfeit notes were spotted. Thomas says the caravan is in town for the Spanish Festival.
Brief: Radio & TV Demographics: 4,000 (male 16+) • 3,000 (female 16+)"""
part_1 = s[s.index("Channel 9"):s.index("Southern Cross")]
part_2 = s[s.index("Southern Cross"):]
然后将它们保存到文件中。
答案 3 :(得分:1)
看起来以" 受众特征:"开头的行充当真正的分隔符。我会用两种方式使用正则表达式:首先,用这些行分割文本;第二,自己提取这些行。然后可以将结果组合起来重建块:
import re
DIVIDER = 'Demographics: .+' # Make it tunable, in case you change your mind
blocks_1 = re.split(DIVIDER, text)
blocks_2 = re.findall(DIVIDER, text)
blocks = ['\n\n'.join(pair) for pair in zip(blocks_1, blocks_2)
blocks[0]
#Channel 9 (1 item)\n\nA woman selling her caravan near ...
#... Demographics: 153,000 (male 16+) • 177,000 (female 16+)