我正在使用正则表达式尝试拆分行,并使用分隔格式编写新文件。
import re,sys
with open('raw.txt', 'rb') as f:
s = f.read()
result = re.split('\s+(\d+)\s+', s.split('n'))
print result
但是,我一直收到“预期的字符串或缓冲区”错误。我尝试使用不太大的测试文件运行此代码,它运行良好。但我想我的原始文件太大(> 20000行),因此read()可能会导致内存问题。
所以,我试着像这样使用f.readlines():
f = open('raw.txt', 'r')
for line in f.readlines():
result = re.split('\s+(\d+)\s+', line)
print result
但它效果不佳。有人可以帮忙吗?谢谢!
仅供参考,我的原始文件如下:
f = open('raw.txt', 'r')
dline = f.readlines()
dline
['Alcoholic liver disease 7124 TNF\n', 'Alcoholic liver disease 3557 IL1RN\n', 'Alcoholic liver disease 929 CD14\n', 'Alopecia 3572 IL6ST\n', 'Alopecia 3976 LIF\n', 'Alopecia 1489 CTF1\n', "Alzheimer's disease 5300 PIN1\n", "Alzheimer's disease 6667 SP1\n", "Alzheimer's disease 3316 HSPB2\n", "Alzheimer's disease 3320 HSP90AA1\n", "Alzheimer's disease 8851 CDK5R1\n", 'Aseptic necrosis of bone 302 ANXA2\n', 'Aseptic necrosis of bone 1499 CTNNB1\n', 'Aseptic necrosis of bone 2147 F2\n', 'Aseptic necrosis of bone 2153 F5\n', 'Aseptic necrosis of bone 5054 SERPINE1\n']
so, what I want to make new file looks like that:
results ## string+'\t'+ integer + '\t' + string +'\n'
['Alcoholic liver disease', '7124', 'TNF\nAlcoholic liver disease', '3557', 'IL1RN\nAlcoholic liver disease', '929', 'CD14\nAlopecia', '3572', 'IL6ST\nAlopecia', '3976', 'LIF\nAlopecia', '1489', "CTF1\nAlzheimer's disease", '5300', "PIN1\nAlzheimer's disease", '6667',
答案 0 :(得分:0)
可替换地,
f = open('raw.txt', 'r')
for line in f.readlines():
result = line.rsplit()
print result
希望这有帮助。