Question

新手问题

我有2个文件文件A：带有项目列表的文件（苹果，梨，橙子）文件B：包含世界上所有水果的文件（1,000,000行）

在unix中，我会从文件B grep apple并返回所有结果

在unix我会 1.来自文件b的grep apple＆gt;＆gt; fruitfound.txt 2.来自文件b的grep梨＆gt;＆gt; fruitfound.txt 3.来自文件b的grep oranges＆gt;＆gt; fruitfound.txt

我想要一个python脚本，它使用来自文件a和搜索文件b的值，然后写出输出。注意：文件B会有青苹果，红苹果，黄苹果，我想将所有3个结果写入fruitfound.txt

最基本的问候

Kornity

Answer 1

grep -f $patterns $filename正是如此。无需使用python脚本。

Answer 2

要在Python中查找包含任何给定关键字的行，您可以使用正则表达式：

import re
from itertools import ifilter

def fgrep(words, lines):
    # note: allow a partial match e.g., 'b c' matches 'ab cd'
    return ifilter(re.compile("|".join(map(re.escape, words))).search, lines)

将其变为命令行脚本：

import sys

def main():
    with open(sys.argv[1]) as kwfile: # read keywords from given file
        # one keyword per line
        keywords = [line.strip() for line in kwfile if line.strip()]

    if not keywords:
       sys.exit("no keywords are given")

    if len(sys.argv) > 2: # read lines to match from given file
        with open(sys.argv[2]) as file:
            sys.stdout.writelines(fgrep(keywords, file))
    else: # read lines from stdin
        sys.stdout.writelines(fgrep(keywords, sys.stdin))

main()

示例：

$ python fgrep.py a b > fruitfound.txt

有更高效的算法，例如Ago-Corasick algorithm，但在我的机器上过滤数百万行需要不到一秒的时间，而且可能足够好（grep快几倍）。令人惊讶的是，基于Ago-Corasick算法的acora对于我尝试过的数据来说速度较慢。

Python：使用文件a中的值来搜索另一个文件中的行

2 个答案: