Question

文本文件1具有以下格式：

'WORD': 1
'MULTIPLE WORDS': 1
'WORD': 2

等

即，由冒号后跟数字分隔的单词。

文本文件2具有以下格式：

'WORD'
'WORD'

等。

我需要从文件1中提取单个单词（即，只有WORD而不是多个单词），如果它们与文件2中的单词匹配，则返回文件1中的单词及其值。

我的代码功能很差：

def GetCounts(file1, file2):
    target_contents  = open(file1).readlines()  #file 1 as list--> 'WORD': n
    match_me_contents = open(file2).readlines()   #file 2 as list -> 'WORD'
    ls_stripped = [x.strip('\n') for x in match_me_contents]  #get rid of newlines

    match_me_as_regex= re.compile("|".join(ls_stripped))   

    for line in target_contents:
        first_column = line.split(':')[0]  #get the first item in line.split
        number = line.split(':')[1]   #get the number associated with the word
        if len(first_column.split()) == 1: #get single word, no multiple words 
            """ Does the word from target contents match the word
            from match_me contents?  If so, return the line from  
            target_contents"""
            if re.findall(match_me_as_regex, first_column):  
                print first_column, number

#OUTPUT: WORD, n
         WORD, n
         etc.

由于使用正则表达式，输出是空洞的。例如，代码将返回'asset，2'，因为re.findall（）将匹配match_me中的'set'。我需要将target_word与match_me中的整个单词进行匹配，以阻止部分正则表达式匹配导致的错误输出。

Answer 1

如果file2不是很大，那就把它们塞进一套：

file2=set(open("file2").read().split())
for line in open("file1"):
    if line.split(":")[0].strip("'") in file2:
        print line

Answer 2

我认为“功能不良”你的意思是速度明智吗？因为我测试了它确实有效。

通过在file2中创建set个单词，可以提高效率：

word_set = set(ls_stripped)

然后代替findall你会看到它是否在集合中：

in_set = just_word in word_set

也比正则表达式更清洁。

Answer 3

看起来这可能仅仅是grep的一个特例。如果file2本质上是一个模式列表，并且输出格式与file1相同，那么你可以这样做：

grep -wf file2 file1

-w告诉grep只匹配整个单词。

Answer 4

这就是我这样做的方式。我手边没有python翻译，所以可能会有一些拼写错误。

进入Python时应该记住的主要事项之一（特别是来自Perl）是正则表达式通常是一个坏主意：字符串方法功能强大且非常快。

def GetCounts(file1, file2):
    data = {}
    for line in open(file1):
        try:
            word, n = line.rsplit(':', 1)
        except ValueError: # not enough values
            #some kind of input error, go to next line
            continue
        n = int(n.strip())
        if word[0] == word[-1] == "'":
            word = word[1:-1]
        data[word] = n

    for line in open(file2):
        word = line.strip()
        if word[0] == word[-1] == "'":
            word = word[1:-1]
        if word in data:
            print word, data[word]

Answer 5

import re, methodcaller

re_target = re.compile(r"^'([a-z]+)': +(\d+)", re.M|re.I)
match_me_contents = open(file2).read().splitlines()
match_me_contents = set(map(methodcaller('strip', "'"), match_me_contents))

res = []
for match in re_target.finditer(open(file1).read()):
    word, value = match.groups()
    if word in match_me_contents:
        res.append((word, value))

Answer 6

我的两个输入文件：

file1.txt：

'WORD': 1
'MULTIPLE WORDS': 1
'OTHER': 2

file2.txt：

'WORD'
'NONEXISTENT'

如果file2.txt 保证一行中没有多个字，则无需从第一个文件中明确过滤这些字。这将由会员资格测试完成：

# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
    allowed_words = set(word.strip() for word in f)

# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
    for line in f:
        word, count = line.strip().split(':')

        # This assumes that strings with a space (multiple words) do not exist in
        # the second file.
        if word in allowed_words:
            print word, count

运行它会给出：

$ python extract.py
'WORD' 1

如果file2.txt可能包含多个单词，只需在循环中修改测试：

# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
    allowed_words = set(word.strip() for word in f)

# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
    for line in f:
        word, count = line.strip().split(':')

        # This prevents multiple words from being selected.
        if word in allowed_words and not ' ' in word:
            print word, count

注意我没有打扰从单词中删除引号。我不确定这是否必要 - 这取决于输入是否保证有它们。添加它们将是微不足道的。

您应该考虑的其他事情是区分大小写的。如果应将小写和大写单词视为相同，则应在进行任何测试之前将所有输入转换为大写（或小写，无关紧要）。

编辑：从允许的单词集中删除多个单词可能效率更高，而不是对file1的每一行进行检查：

# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
    allowed_words = set(word.strip() for word in f if not ' ' in f)

# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
    for line in f:
        word, count = line.strip().split(':')

        # Check if the word is allowed.
        if word in allowed_words:
            print word, count

Answer 7

这就是我提出的：

def GetCounts(file1, file2):
    target_contents  = open(file1).readlines()  #file 1 as list--> 'WORD': n
    match_me_contents = set(open(file2).read().split('\n'))   #file 2 as list -> 'WORD'  
    for line in target_contents:
        word = line.split(': ')[0]  #get the first item in line.split
        if " " not in word:
            number = line.split(': ')[1]   #get the number associated with the word
            if word in match_me_contents:  
                print word, number

版本变更：

从正则表达式
在没有额外处理的情况下，去分割而不是读取线去除换行
将单词拆分为单词并检查单词的长度是否只是检查空格是否在“单词”中
- 如果“空格”不是真正的空间，这可能会导致错误。这可以用“\ s”或等效的正则表达式修复，但是会有性能损失。
在line.split（'：'）中添加了一个空格，以便该方式编号不会以空格为前缀
- 如果号码前没有空格，这可能会导致错误。
检查后移动number = line.split(': ')[1]以查看该单词是否包含用于提高效率的空格，虽然速度差异很小（几乎可以肯定，大部分时间用于检查工作是在目标中）

如果实际输入不是您提供的格式，则只会发生潜在的错误。

Answer 8

让我们利用文件格式与Python表达式语法的相似性：

from ast import literal_eval
with file("file1") as f:
  word_values = ast.literal_eval('{' + ','.join(line for line in f) + '}')
with file("file2") as f:
  expected_words = set(ast.literal_eval(line) for line in f)
word_values = {k: v for (k, v) in word_values if k in expected_words}

Python，遍历文件中的行;如果line等于另一个文件中的行，则返回原始行

8 个答案: