最短的代码

Question

我的文件看起来像这样：

#     BJD     K2SC-Flux EAPFlux   Err  Flag Spline
2457217.463564 5848.004 5846.670 6.764 0 0.998291
2457217.483996 6195.018 6193.685 6.781 1 0.998291
2457217.504428 6396.612 6395.278 6.790 0 0.998292
2457217.524861 6220.890 6219.556 6.782 0 0.998292
2457217.545293 5891.856 5890.523 6.766 1 0.998292
2457217.565725 5581.000 5579.667 6.749 1 0.998292
2457217.586158 5230.566 5229.232 6.733 1 0.998292
2457217.606590 4901.128 4899.795 6.718 0 0.998293
2457217.627023 4604.127 4602.793 6.700 0 0.998293

我需要找到并计数标记为1的行。（第5列。）这是我的方法：

foundlines=[]
c=0
import re
with open('examplefile') as f:
    for index, line in enumerate(f):
        try:
            found = re.findall(r' 1 ', line)[0]
            foundlines.append(index)
            print(line)
            c+=1
        except:
            pass
print(c)

在Shell中，我只想做grep " 1 " examplefile | wc -l，它比上面的Python脚本短得多。 python脚本可以工作，但是我感兴趣的是是否有比上述脚本更短，更紧凑的方法来完成任务？我更喜欢Shell的缩写，所以我想在Python中有类似的东西。

Answer 1

您拥有CSV数据，可以使用csv模块：

import csv

with open('your file', 'r', newline='', encoding='utf8') as fp:
    rows = csv.reader(fp, delimiter=' ')

    # generator comprehension
    errors = (row for row in rows if row[4] == '1')

for error in errors:
    print(error)

Answer 2

您的shell实现可以做得更短，grep具有-c选项可以使您计数，不需要匿名管道和wc：

grep -c " 1 " examplefile

您的shell代码只是让您获得找到模式1的行数，但是您的Python代码还保留了与该模式匹配的行的索引列表。

只能获取行数，可以使用sum和genexp / list理解，也不需要Regex；简单的字符串__contains__检查就可以了，因为字符串是可迭代的：

with open('examplefile') as f:
    count = sum(1 for line in f if ' 1 ' in line)
    print(count)

如果您还希望保留索引，则可以坚持只用re测试替换str测试：

count = 0
indexes = []
with open('examplefile') as f:
    for idx, line in enumerate(f):
        if ' 1 ' in line:
            count += 1
            indexes.append(idx)

另外，光做except几乎总是一个坏主意（至少您应该使用except Exception来排除SystemExit，KeyboardInterrupt就像异常一样），仅捕获您可能会提出的异常。

此外，在解析结构化数据时，您应使用特定工具，例如在这里csv.reader，以空格作为分隔符（在这种情况下，line.split(' ')也应该这样做），并且检查索引4是最安全的（请参阅Tomalak's answer）。使用' 1 ' in line测试时，如果任何其他列包含1，则会产生误导性结果。

考虑到上述情况，这是使用awk与第5个字段进行匹配的shell方法：

awk '$5 == "1" {count+=1}; END{print count}' examplefile

Answer 3

最短的代码

在某些特定前提下，这是一个非常简短的版本：

您只想计算grep调用之类的发生次数
保证每行只有一个UK
" 1 "只能出现在所需的列中
您的文件很容易放入内存

请注意，如果不满足这些前提条件，则可能会导致内存问题或返回误报。

" 1 "

简单而通用，稍长

当然，如果以后有兴趣对这些行做一些实际的事情，我建议您使用Pandas：

print(open("examplefile").read().count(" 1 "))

要获取Flag为1的所有行

df = pandas.read_table('test.txt', delimiter=" ",
                       comment="#",
                       names=['BJD', 'K2SC-Flux', 'EAPFlux', 'Err', 'Flag', 'Spline'])

返回：

flaggedrows = df[df.Flag == 1]

要计数：

            BJD  K2SC-Flux   EAPFlux    Err  Flag    Spline
1  2.457217e+06   6195.018  6193.685  6.781     1  0.998291
4  2.457218e+06   5891.856  5890.523  6.766     1  0.998292
5  2.457218e+06   5581.000  5579.667  6.749     1  0.998292
6  2.457218e+06   5230.566  5229.232  6.733     1  0.998292

返回4

正则表达式实现比循环整个文件更好？

3 个答案:

最短的代码

简单而通用，稍长