Question

我得到几个grep：运行此代码时写错误。我错过了什么？

这只是其中的一部分：

     while d <= datetime.datetime(year, month, daysInMonth[month]):
        day = d.strftime("%Y%m%d")
        print day
        results = [day]
        first=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' "+ monthDir +"/"+day+"*.txt | grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt", shell=True, stdout=subprocess.PIPE, )
        output1=first.communicate()[0]
        d += delta
        day = d.strftime("%Y%m%d")
        second=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' "+ monthDir +"/"+day+"*.txt | grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt", shell=True,  stdout=subprocess.PIPE, )
        output2=second.communicate()[0]
        articleList = (output1.split('\n'))
        articleList2 = (output2.split('\n'))
        results.append( len(articleList)+len(articleList2))
        w.writerow(tuple(results))
        d += delta

Answer 1

当你这样做时

A | B

在shell中，进程A的输出作为输入通过管道传输到进程B.如果进程B在读取进程A的所有输出之前关闭（例如因为它找到了它正在查找的内容，这是-l选项的功能），则进程A可能会抱怨其输出管道过早关闭。

这些错误基本上是无害的，您可以通过将子流程中的stderr重定向到/dev/null来解决这些问题。

更好的方法可能只是使用Python强大的正则表达式功能来读取文件：

def fileContains(fn, pat):
    with open(file) as f:
        for line in f:
            if re.search(pat, line):
                return True
    return False

first = []
for file in glob.glob(monthDir +"/"+day+"*.txt"):
    if fileContains(file, 'Algeria|Bahrain') and fileContains(file, 'Protest|protesters'):
        file.append(first)

Answer 2

要查找与两种模式匹配的文件，命令结构应为：

grep -l pattern1 $(grep -l pattern2 files)

$(command)将命令的输出替换为命令行。

所以你的脚本应该是：

first=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' $("+ grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt)", shell=True, stdout=subprocess.PIPE, )

，同样适用于second

Answer 3

如果您只是在寻找整个单词，可以使用count()成员函数;

# assuming names is a list of filenames
for fn in names:
    with open(fn) as infile:
        text = infile.read().lower()
    # remove puntuation
    text = text.replace(',', '')
    text = text.replace('.', '')
    words = text.split()
    print "Algeria:", words.count('algeria')
    print "Bahrain:", words.count('bahrain')
    print "protesters:", words.count('protesters')
    print "protest:", words.count('protest')

如果您想要更强大的过滤功能，请使用re。

grep：写入错误：使用子进程断开管道

3 个答案: