Question

我有一个包含字符串的数组。我有一个文本文件。我想逐行遍历文本文件。并检查我的数组的每个元素是否存在。（它们必须是整个单词，而不是子串）我被困了，因为我的脚本只检查是否存在第一个数组元素。但是，我希望它返回每个数组元素的结果和一个关于此数组元素是否存在于整个文件中的注释。

#!/usr/bin/python



with open("/home/all_genera.txt") as file:

    generaA=[]

    for line in file:
        line=line.strip('\n')
        generaA.append(line)


with open("/home/config/config2.cnf") as config_file:
    counter = 0
    for line in config_file:
        line=line.strip('\n')

        for part in line .split():
            if generaA[counter]in part:
                print (generaA[counter], "is -----> PRESENT")
            else:
                continue
    counter += 1

Answer 1

from collection import Counter
import re

#first normalize the text (lowercase everything and remove puncuation(anything not alphanumeric)
normalized_text = re.sub("[^a-z0-9 ]","",open("some.txt","rb").read().lower())
# note that this normalization is subject to the rules of the language/alphabet/dialect you are using, and english ascii may not cover it

#counter will collect all the words into a dictionary of [word]:count
words = Counter(normalized_text.split())

# create a new set of all the words in both the text and our word_list_array
set(my_word_list_array).intersection(words.keys())

Answer 2

如果我理解正确，您需要两个文件中的一系列单词。如果是，set是您的朋友：

def parse(f):
    return set(word for line in f for word in line.strip().split())

with open("path/to/genera/file") as f:
    source = parse(f)
with open("path/to/conf/file" as f:
    conf = parse(f)

# elements that are common to both sets
common = conf & source
print(common)

# elements that are in `source` but not in `conf`
print(source - conf)

# elements that are in `conf` but not in `source`
print(conf - source)

所以回答＆＃34;我希望它返回每个数组元素的结果和一个关于这个数组元素是否存在于整个文件中的注释＆＃34;，你可以使用任何公共元素或者注释source - conf列表的source差异：

# using common elements
common = conf & source
result = [(word, word in common) for word in source]
print(result)

# using difference
diff = source - conf
result = [(word, word not in diff) for word in source]

两者都会得到相同的结果，因为设置查找是O（1），所以perfs也应该相似，所以我建议第一个解决方案（正面断言对大脑比对负面更容易）。

当您构建集合时，您当然可以应用进一步的清理/规范化，即如果您想要不区分大小写的搜索：

def parse(f):
    return set(word.lower() for line in f for word in line.strip().split())

Answer 3

计数器没有增加，因为它在for循环之外。

with open("/home/all_genera.txt") as myfile: # don't use 'file' as variable, is a reserved word! use myfile instead

    generaA=[]

    for line in myfile: # use .readlines() if you want a list of lines!
        generaA.append(line)

# if you just need to know if string are present in your file, you can use .read():
with open("/home/config/config2.cnf") as config_file:
    mytext = config_file.read()
    for mystring in generaA:
        if mystring in mytext:
            print mystring, "is -----> PRESENT"

# if you want to check if your string in line N is present in your file in the same line, you can go with:
with open("/home/config/config2.cnf") as config_file:
    for N, line in enumerate(config):
        if generaA[N] in line:
            print "{0} is -----> PRESENT in line {1}".format(generaA[N], N)

我希望一切都清楚。

此代码可以通过多种方式进行改进，但我尝试将其与您的代码类似，以便更容易理解

如何根据数组元素检查文本文件中是否存在字符串？

3 个答案: