如何从文件的每一行中提取字符和数字?

时间:2014-07-23 13:32:34

标签: python regex string file-io extraction

我尝试从每行文件中提取第一个字符,第二个数字和第三个字符,并存储到名为FirstChar,SecondNum,ThirdChar的三个变量中。

输入文件(MultiPointMutation.txt):

P1T,C11F,E13T
L7A
E2W

预期产出:

FirstChar="PCELE"
SecondNum="1 11 13 7 2"
ThirdChar="TFTAW"

我的代码:

 import re 
 import itertools
 ns=map(lambda x:x.strip(),open('MultiplePointMutation.txt','r').readlines())#reading  file
 for line in ns:
         second="".join(re.findall(r'\d+',line))#extract second position numbers
         print second # print second nums
         char="".join(re.findall(r'[a-zA-Z]',line))#Extract all characters
         c=str(char.rstrip())
         First=0
         Third=1
         for index in range(len(c)):
                 if index==First:
                         FC=c[index]#here i got all first characters
                         print FC
                         First=First+2
                 if index==Third:
                         TC=c[index]
                         print TC
                         Third=Third+2#here i got all third characters

输出: 在这里,我将FirstCharacter和ThirdCharacter完全正确

FirstChar:
          P
          C
          E
          L
          E
ThirdChar:
          T
          F
          T
          A
          W

但问题在于获得SecondNum:

           SecondNum:
           11113
           7
           2

我想提取数字如下:

          1
          11
          13
          7
          2

注意:在这里,我不想逐个打印。我希望逐个读取这个SecondNum变量值以供后者使用。

1 个答案:

答案 0 :(得分:0)

对于secondNum,您只需修改该行:

second="".join(re.findall(r'\d+',line))#extract second position numbers

second="\n".join(re.findall(r'\d+',line))#extract second position numbers

但我认为你的第一个和第三个字符不能正常工作。从您想要接收的第一个输出中,您应该具有以下内容:

 import re

 x= """P1T,C11F,E13T
 L7A
 E2W"""

 secondNum = []
 firstChar = []
 thirdChar = []
 for line in x.split('\n'):

      [secondNum.append(a) for a in re.findall('\d+',line)]

      [firstChar.append(a) for a in re.findall('(?:^|,)([a-zA-Z])',line)]
      # this is an inline for loop which takes each element returned from re.findall  
      # and appends it to the firstChar Array
      # the regex searchs for the start of the string (^) or a comma(,) and this is a 
      # non capturing group (starting with (?:  meaning that the result of this group 
      # is not considered for the returned result and finally capture 1 character 
      # [a-zA-Z] behind the comma or the start which should be the first character

      [thirdChar.append(a) for a in re.findall('(?:^\w\d+|,\w\d+)([a-zA-Z])',line)
      # the third char works quite similar, but the non capturing group searchs for a 
      # comma or start of the string again followed by 1 char and at least one number 
      # (\d+) after this number there should be the third character which is in the 
      # captured group again

 print "firstChar=\""+str(firstChar)+"\""
 print "secondNum=\""+str(secondNum)+"\""
 print "thirdChar=\""+str(thirdChar)+"\"" 

但你的第三个角色是L7A(你想要A的数字)的第三个角色,但它也是P1TQ的第四个角色(你想要Q的地方)