拆分字符串和类型错误问题

时间:2014-08-03 16:04:54

标签: python regex

我正在使用正则表达式尝试拆分行,并使用分隔格式编写新文件。

import re,sys

with open('raw.txt', 'rb') as f:
    s = f.read()
    result = re.split('\s+(\d+)\s+', s.split('n'))
    print result

但是,我一直收到“预期的字符串或缓冲区”错误。我尝试使用不太大的测试文件运行此代码,它运行良好。但我想我的原始文件太大(> 20000行),因此read()可能会导致内存问题。

所以,我试着像这样使用f.readlines():

 f = open('raw.txt', 'r') 
 for line in f.readlines():
     result = re.split('\s+(\d+)\s+', line)
     print result

但它效果不佳。有人可以帮忙吗?谢谢!

仅供参考,我的原始文件如下:

f = open('raw.txt', 'r')
dline = f.readlines()
dline 
['Alcoholic liver disease 7124    TNF\n', 'Alcoholic liver disease 3557    IL1RN\n', 'Alcoholic liver disease 929     CD14\n', 'Alopecia        3572    IL6ST\n', 'Alopecia        3976    LIF\n', 'Alopecia        1489    CTF1\n', "Alzheimer's disease     5300    PIN1\n", "Alzheimer's disease     6667    SP1\n", "Alzheimer's disease     3316    HSPB2\n", "Alzheimer's disease     3320    HSP90AA1\n", "Alzheimer's disease     8851    CDK5R1\n", 'Aseptic necrosis of bone        302     ANXA2\n', 'Aseptic necrosis of bone        1499    CTNNB1\n', 'Aseptic necrosis of bone        2147    F2\n', 'Aseptic necrosis of bone        2153    F5\n', 'Aseptic necrosis of bone        5054    SERPINE1\n']
so, what I want to make new file looks like that:
results ## string+'\t'+ integer + '\t' + string +'\n'
['Alcoholic liver disease', '7124', 'TNF\nAlcoholic liver disease', '3557', 'IL1RN\nAlcoholic liver disease', '929', 'CD14\nAlopecia', '3572', 'IL6ST\nAlopecia', '3976', 'LIF\nAlopecia', '1489', "CTF1\nAlzheimer's disease", '5300', "PIN1\nAlzheimer's disease", '6667',

1 个答案:

答案 0 :(得分:0)

可替换地,

 f = open('raw.txt', 'r') 
 for line in f.readlines():
     result = line.rsplit()
     print result

希望这有帮助。