在Python中读取文件时忽略空行的最简单方法

时间:2011-01-30 09:05:21

标签: python

我有一些代码可以读取名称文件并创建一个列表:

names_list = open("names", "r").read().splitlines()

每个名称都用换行符分隔,如下所示:

Allman
Atkinson

Behlendorf 

我想忽略任何只包含空格的行。我知道我可以通过创建循环并检查我读取的每一行然后将其添加到列表(如果它不是空白)来完成此操作。

我只是想知道是否有更多的Pythonic方式呢?

10 个答案:

答案 0 :(得分:55)

我会堆叠生成器表达式:

with open(filename) as f_in:
    lines = (line.rstrip() for line in f_in) # All lines including the blank ones
    lines = (line for line in lines if line) # Non-blank lines

现在,lines是所有非空行。这样可以避免两次调用线路上的条带。如果你想要一个行列表,那么你可以这样做:

with open(filename) as f_in:
    lines = (line.rstrip() for line in f_in) 
    lines = list(line for line in lines if line) # Non-blank lines in a list

你也可以用一行代码(不包括with语句)来做,但它没有更高效,更难阅读:

with open(filename) as f_in:
    lines = list(line for line in (l.strip() for l in f_in) if line)

更新

我同意这是因为重复令牌而丑陋。如果你愿意,你可以写一个生成器:

def nonblank_lines(f):
    for l in f:
        line = l.rstrip()
        if line:
            yield line

然后称之为:

with open(filename) as f_in:
    for line in nonblank_lines(f_in):
        # Stuff

更新2:

with open(filename) as f_in:
    lines = filter(None, (line.rstrip() for line in f_in))

和CPython(带有确定性引用计数)

lines = filter(None, (line.rstrip() for line in open(filename)))

在Python 2中使用itertools.ifilter如果你想要一个生成器而在Python 3中,如果你想要一个列表,只需将整个内容传递给list

答案 1 :(得分:17)

您可以使用列表理解:

with open("names", "r") as f:
    names_list = [line.strip() for line in f if line.strip()]

更新:删除了不必要的readlines()

为避免两次调用line.strip(),您可以使用生成器:

names_list = [l for l in (line.strip() for line in f) if l]

答案 2 :(得分:7)

如果你想要,你可以把你所拥有的东西放在列表理解中:

names_list = [line for line in open("names.txt", "r").read().splitlines() if line]

all_lines = open("names.txt", "r").read().splitlines()
names_list = [name for name in all_lines if name]

splitlines()已经删除了行结尾。

我认为这些并不像显式循环那样明确:

names_list = []
with open('names.txt', 'r') as _:
    for line in _:
        line = line.strip()
        if line:
            names_list.append(line)

编辑:

虽然,过滤器看起来非常简洁:

names_list = filter(None, open("names.txt", "r").read().splitlines())

答案 3 :(得分:3)

当必须对文本进行处理以便从中提取数据时,我总是首先考虑正则表达式,因为:

  • 据我所知,正在为此发明了正则表达式

  • 迭代线对我来说显得笨拙:它主要是搜索换行符然后搜索要在每一行中提取的数据;这使得两次搜索而不是具有正则表达式

  • 的直接唯一搜索
  • 将正则表达式发挥作用的方法很容易;只编写要编译成正则表达式对象的正则表达式字符串有时很难,但在这种情况下,对行进行迭代处理也会很复杂

对于此处讨论的问题,正则表达式解决方案快速且易于编写:

import re
names = re.findall('\S+',open(filename).read())

我比较了几种解决方案的速度:

import re
from time import clock

A,AA,B1,B2,BS,reg = [],[],[],[],[],[]
D,Dsh,C1,C2 = [],[],[],[]
F1,F2,F3  = [],[],[]

def nonblank_lines(f):
    for l in f:
        line = l.rstrip()
        if line:  yield line

def short_nonblank_lines(f):
    for l in f:
        line = l[0:-1]
        if line:  yield line

for essays in xrange(50):

    te = clock()
    with open('raa.txt') as f:
        names_listA = [line.strip() for line in f if line.strip()] # Felix Kling
    A.append(clock()-te)

    te = clock()
    with open('raa.txt') as f:
        names_listAA = [line[0:-1] for line in f if line[0:-1]] # Felix Kling with line[0:-1]
    AA.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f_in:
        namesB1 = [ name for name in (l.strip() for l in f_in) if name ] # aaronasterling without list()
    B1.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        namesB2 = [ name for name in (l[0:-1] for l in f_in) if name ] # aaronasterling without list() and with line[0:-1]
    B2.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        namesBS = [ name for name in f_in.read().splitlines() if name ] # a list comprehension with read().splitlines()
    BS.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f:
        xreg = re.findall('\S+',f.read()) #  eyquem
    reg.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f_in:
        linesC1 = list(line for line in (l.strip() for l in f_in) if line) # aaronasterling
    C1.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        linesC2 = list(line for line in (l[0:-1] for l in f_in) if line) # aaronasterling  with line[0:-1]
    C2.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f_in:
        yD = [ line for line in nonblank_lines(f_in)  ] # aaronasterling  update
    D.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        yDsh = [ name for name in short_nonblank_lines(f_in)  ] # nonblank_lines with line[0:-1]
    Dsh.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f_in:
        linesF1 = filter(None, (line.rstrip() for line in f_in)) # aaronasterling update 2
    F1.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        linesF2 = filter(None, (line[0:-1] for line in f_in)) # aaronasterling update 2 with line[0:-1]
    F2.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        linesF3 =  filter(None, f_in.read().splitlines()) # aaronasterling update 2 with read().splitlines()
    F3.append(clock()-te)


print 'names_listA == names_listAA==namesB1==namesB2==namesBS==xreg\n  is ',\
       names_listA == names_listAA==namesB1==namesB2==namesBS==xreg
print 'names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3\n  is ',\
       names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3,'\n\n\n'


def displ((fr,it,what)):  print fr + str( min(it) )[0:7] + '   ' + what

map(displ,(('* ', A,    '[line.strip() for line in f if line.strip()]               * Felix Kling\n'),

           ('  ', B1,   '    [name for name in (l.strip() for l in f_in) if name ]    aaronasterling without list()'),
           ('* ', C1,   'list(line for line in (l.strip() for l in f_in) if line)   * aaronasterling\n'),          

           ('* ', reg,  're.findall("\S+",f.read())                                 * eyquem\n'),

           ('* ', D,    '[ line for line in       nonblank_lines(f_in)  ]           * aaronasterling  update'),
           ('  ', Dsh,  '[ line for line in short_nonblank_lines(f_in)  ]             nonblank_lines with line[0:-1]\n'),

           ('* ', F1 ,  'filter(None, (line.rstrip() for line in f_in))             * aaronasterling update 2\n'),

           ('  ', B2,   '    [name for name in (l[0:-1]   for l in f_in) if name ]    aaronasterling without list() and with line[0:-1]'),
           ('  ', C2,   'list(line for line in (l[0:-1]   for l in f_in) if line)     aaronasterling  with line[0:-1]\n'),

           ('  ', AA,   '[line[0:-1] for line in f if line[0:-1]  ]                   Felix Kling with line[0:-1]\n'),

           ('  ', BS,   '[name for name in f_in.read().splitlines() if name ]        a list comprehension with read().splitlines()\n'),

           ('  ', F2 ,  'filter(None, (line[0:-1] for line in f_in))                  aaronasterling update 2 with line[0:-1]'),

           ('  ', F3 ,  'filter(None, f_in.read().splitlines()                        aaronasterling update 2 with read().splitlines()'))
    )

正则表达式的解决方案简单明了。虽然,它不是最快的。 带有过滤器()的aaronasterling解决方案对我来说非常快(我不知道这个特定的过滤器()的速度),优化解决方案的时间下降到最大时间的27%。我想知道是什么使得filter-splitlines关联的奇迹:

names_listA == names_listAA==namesB1==namesB2==namesBS==xreg
  is  True
names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3
  is  True 



* 0.08266   [line.strip() for line in f if line.strip()]               * Felix Kling

  0.07535       [name for name in (l.strip() for l in f_in) if name ]    aaronasterling without list()
* 0.06912   list(line for line in (l.strip() for l in f_in) if line)   * aaronasterling

* 0.06612   re.findall("\S+",f.read())                                 * eyquem

* 0.06486   [ line for line in       nonblank_lines(f_in)  ]           * aaronasterling  update
  0.05264   [ line for line in short_nonblank_lines(f_in)  ]             nonblank_lines with line[0:-1]

* 0.05451   filter(None, (line.rstrip() for line in f_in))             * aaronasterling update 2

  0.04689       [name for name in (l[0:-1]   for l in f_in) if name ]    aaronasterling without list() and with line[0:-1]
  0.04582   list(line for line in (l[0:-1]   for l in f_in) if line)     aaronasterling  with line[0:-1]

  0.04171   [line[0:-1] for line in f if line[0:-1]  ]                   Felix Kling with line[0:-1]

  0.03265   [name for name in f_in.read().splitlines() if name ]        a list comprehension with read().splitlines()

  0.03638   filter(None, (line[0:-1] for line in f_in))                  aaronasterling update 2 with line[0:-1]
  0.02198   filter(None, f_in.read().splitlines()                        aaronasterling update 2 with read().splitlines()

但是这个问题很特别,最简单的问题是:每行只有一个名字。所以解决方案只是包含线条,分割和[0:-1]切割的游戏。

相反,正则表达式与行无关,它直接找到所需的数据:我认为它是一种更自然的解决方法,适用于从最简单到更复杂的情况,因此通常是优先考虑文本处理。

修改

我忘了说我使用的是Python 2.7,我使用包含以下链的500倍的文件测量了上述时间

SMITH
JONES
WILLIAMS
TAYLOR
BROWN
DAVIES
EVANS
WILSON
THOMAS
JOHNSON

ROBERTS
ROBINSON
THOMPSON
WRIGHT
WALKER
WHITE
EDWARDS
HUGHES
GREEN
HALL

LEWIS
HARRIS
CLARKE
PATEL
JACKSON
WOOD
TURNER
MARTIN
COOPER
HILL

WARD
MORRIS
MOORE
CLARK
LEE
KING
BAKER
HARRISON
MORGAN
ALLEN

JAMES
SCOTT
PHILLIPS
WATSON
DAVIS
PARKER
PRICE
BENNETT
YOUNG
GRIFFITHS

MITCHELL
KELLY
COOK
CARTER
RICHARDSON
BAILEY
COLLINS
BELL
SHAW
MURPHY

MILLER
COX
RICHARDS
KHAN
MARSHALL
ANDERSON
SIMPSON
ELLIS
ADAMS
SINGH

BEGUM
WILKINSON
FOSTER
CHAPMAN
POWELL
WEBB
ROGERS
GRAY
MASON
ALI

HUNT
HUSSAIN
CAMPBELL
MATTHEWS
OWEN
PALMER
HOLMES
MILLS
BARNES
KNIGHT

LLOYD
BUTLER
RUSSELL
BARKER
FISHER
STEVENS
JENKINS
MURRAY
DIXON
HARVEY

答案 4 :(得分:3)

您可以使用not

for line in lines:
    if not line:
        continue

答案 5 :(得分:0)

@美国洛特

以下代码一次处理一行,并产生一个非内存渴望的结果:

filename = 'english names.txt'

with open(filename) as f_in:
    lines = (line.rstrip() for line in f_in)
    lines = (line for line in lines if line)
    the_strange_sum = 0
    for l in lines:
        the_strange_sum += 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'.find(l[0])

print the_strange_sum

因此,生成器(f_in中line的line.rstrip())与nonblank_lines()函数完全相同。

答案 6 :(得分:0)

LineSentence模块怎么样,它会忽略这些行:

  

基地:对象

     

简单格式:一句话=一行;已经预处理的单词和   被空白隔开。

     

source可以是字符串或文件对象。将文件剪辑到   第一个限制线(如果限制为无,则默认为限制)。

Stream.iterate(new BigInteger("1000000000000"), BigInteger::nextProbablePrime)
      .filter(b -> b.isProbablePrime(1))

答案 7 :(得分:0)

我想这里有一个简单的解决方案,我最近在经历了这么多答案后使用过。

with open(file_name) as f_in:   
    for line in f_in:
        if len(line.split()) == 0:
            continue

这只做同样的工作,忽略了所有空行。

答案 8 :(得分:0)

你们为什么都努力?

with open("myfile") as myfile:
    nonempty = filter(str.rstrip, myfile)

如果需要,可以将非空转换为列表,尽管我强烈建议保持非空生成器的状态,就像在Python 3.x中一样。

在Python 2.x中,您可以使用itertools.ifilter进行出价。

答案 9 :(得分:0)

您可以在 Python >= 3.8 中使用 Walrus 运算符

with open('my_file') as fd:
    nonblank = [stripped for line in fd if (stripped := line.strip())]

认为'blablabla if stripped (定义为 line.strip) 是真的'