用于跟踪字符串中的查找和替换的Python正则表达式

时间:2013-04-09 14:30:16

标签: python regex numpy

我正在尝试使用numpy数组加载大型文本数据。 Numpy的loadtxt和genfromtxt不适用于,

  • 首先,我需要删除以分隔符['#','!','C']
  • 开头的注释行
  • 第二,数据中的重复模式为n*value,其中n是重复的整数,value是浮点数据。

因此我尝试使用readlines()读取文本文件,然后使用Numpy的loadtxt将数据转换为Numpy数组。

对于阅读和替换,我尝试使用正则表达式(re模块),但无法使其正常工作。但是,以下Python代码正在运行。我的问题是最有效和Pythonic的做法是什么?

如果是RegEx,在readlines()列表对象中跟踪查找和替换的正确的正则表达式代码是什么:

lines = ['1 2 3*2.5 3 6 1*.3 8 \n', '! comment here\n', '1*1 2.0 2*2.1 3 6 0 8 \n']
for l, line in enumerate(lines):
    if line.strip() == '' or line.strip()[0] in ['#','!','C']:
        del lines[l]        
for l, line in enumerate(lines):
    repls = [word  for word in line.strip().split() if word.find('*')>=0]
    print repls
    for repl in repls:
        print repl
        line = line.replace(repl, ' '.join([repl.split('*')[1] for n in xrange(int(repl.split('*')[0]))]))
    lines[l] = line
print lines

输出如下:

['1 2 2.5 2.5 2.5 3 6 .3 8 \n', '1 2.0 2.1 2.1 3 6 0 8 \n']

编辑:

评论后,我编辑了我的Python代码如下:

    in_lines = ['1 2 3*2.5 3 6 1*.3 8 \n', '! comment here\n', '1*1 2.0 2*2.1 3 6 0 8 \n']
    lines = []
    for line in in_lines:
        if line.strip() == '' or line.strip()[0] in ['#','!','C']:
            continue        
        else:
            repls = [word  for word in line.strip().split() if word.find('*')>=0]
            for repl in repls:
                line = line.replace(repl, ' '.join([float(repl.split('*')[1]) for n in xrange(int(repl.split('*')[0]))]))
            lines.append(line)
    print lines

1 个答案:

答案 0 :(得分:1)

Pythonic方式

使用python的强大功能特性和列表理解:

#!/usr/bin/env python

lines = ['1 2 3*2.5 3 6 1*.3 8 \n', '! comment here\n', '1*1 2.0 2*2.1 3 6 0 8 \n']

#filter out comments
lines = [line for line in lines if  line.strip() != '' and line.strip()[0] not in ['#','!','C']]

#turns lines into lists of tokens
lines = [[word for word in line.strip().split()] for line in lines]

# turns a list of strings into a number generator, parsing '*' properly
def generate_numbers(tokens):
  for token in tokens:
    if '*' in token:
      n,m = token.split("*")
      for i in range(int(n)):
        yield float(m)
    else:
      yield float(token)

# use the generator to clean up the lines
lines = [list(generate_numbers(tokens)) for tokens in lines]

print lines

输出:

➤ ./try.py 
[[1.0, 2.0, 2.5, 2.5, 2.5, 3.0, 6.0, 0.3, 8.0], [1.0, 2.0, 2.1, 2.1, 3.0, 6.0, 0.0, 8.0]]

快速和小型的Pythonic方式

此解决方案使用生成器而不是列表,因此您不必将整个文件加载到内存中。注意使用两个习语:

  1. with open("name") as file

    退出块后,这将清理文件句柄。

  2. for line in file

    这将使用生成器迭代文件中的行,而不会将整个文件加载到内存中。

  3. 这给了我们:

    #!/usr/bin/env python
    
    # turns a list of strings into a number generator, parsing '*' properly
    def generate_numbers(tokens):
      for token in tokens:
        if '*' in token:
          n,m = token.split("*")
          for i in range(int(n)):
            yield float(m)
        else:
          yield float(token)
    
    # Pull this out to make the code more readable
    def not_comment(line):
      return line.strip() != '' and line.strip()[0] not in ['#','!','C']
    
    with open("try.dat") as file:
      lines = ( 
        list(generate_numbers((word for word in line.strip().split()))) 
        for line in file if not_comment(line)
      ) # lines is a lazy generator
    
      for line in lines:
        print line
    

    输出:

    ➤ ./try.py 
    [1.0, 2.0, 2.5, 2.5, 2.5, 3.0, 6.0, 0.3, 8.0]
    [1.0, 2.0, 2.1, 2.1, 3.0, 6.0, 0.0, 8.0]