我正在尝试将包含两篇文章的文件拆分为两个单独的文件,每个文件中包含一篇文章,以便后续分析文章。初始文件中的每篇文章都有一个ID,我想用它来分隔文件,使用RE。
以下是初始输入文件,ID号为:
166068619 #### "Epilepsy: let's end our ignorance of this neglected condition
Helen Stephens is a young woman with epilepsy [...]."
106899978 #### "Great British Payoff shows that BBC governance is broken
If it was a television series, they'd probably call it [...]."
然而,当我运行我的代码时,我确实得到两个单独的文件作为输出,但它们是空的。
这是我的代码:
def file_split(path_to_file):
"""Function splits bigger file into N smaller ones, based on a certain RE
match, that is used to break the bigger file into smaller ones"""
def pattern_extract(path_to_file):
"""Function identifies the number of RE occurences in a file,
No. can be used in further analysis as range No."""
import re
x = []
with open(path_to_file) as f:
for line in f:
match = re.search(r'^\d+?\t####\t', line)
if match:
a = match.group()
x.append(a)
return len(x)
y = pattern_extract(path_to_file)
m = y + 1
files = [open('filename%i.txt' %i, 'w') for i in range(1,m)]
with open(path_to_file) as f:
for line in f:
match = re.search(r'^\d+?\t####\t', line)
if match:
a = match.group()
#files = [open('filename%i.txt' %i, 'w') for i in range(1, m)]
files[i-1].write(a)
for f in files:
f.close()
return files
输出结果如下:
file_split(path)
Out[19]:
[<open file 'filename1.txt', mode 'w' at 0x7fe121b130c0>,
<open file 'filename2.txt', mode 'w' at 0x7fe121b131e0>]
我是Python新手,我不太确定问题出在哪里。我检查了一些解决多个文件输出的其他答案,但无法找出解决方案。非常感谢帮助。
答案 0 :(得分:0)
您的代码存在两个问题:
i
,循环变量“遗留”来自列表理解要修复它,您可以将代码的下半部分更改为:
y = pattern_extract(path_to_file)
files = [open('filename%i.txt' %i, 'w') for i in range(y)]
n = -1
with open(path_to_file) as f:
for line in f:
if re.search(r'^\d+\s+####\s+', line):
n += 1
files[n].write(line)
但是你根本不需要读取文件两次,只是为了计算匹配:当行与ID行匹配时直接打开另一个文件并直接写入列表中的最后一个文件,然后关闭所有文件
open_files = []
with open(path_to_file) as f:
for line in f:
if re.search(r'^\d+\s+####\s+', line):
open_files.append(open('filename%d.txt' % len(open_files), 'w'))
open_files[-1].write(line)
for f in open_files:
f.close()