Question

我正在开发一个项目，要求我能够在文件中搜索多个关键字。例如，如果我有一个文件包含100个单词＆＃34; Tomato＆＃34;，500为单词＆＃34; Bread＆＃34;和20为＆＃34; Pickle＆＃34;，我会希望能够搜索文件＆＃34;番茄＆＃34;和＆＃34;面包＆＃34;并获取它在文件中出现的次数。我能够找到具有相同问题/问题的人，但是对于本网站上的其他语言。

我是一个工作程序，它允许我搜索列名称并计算该列中显示的内容的次数，但我想要更精确一些。这是我的代码：

def start():
    location = raw_input("What is the folder containing the data you like processed located? ")
    #location = "C:/Code/Samples/Dates/2015-06-07/Large-Scale Data Parsing/Data Files"
    if os.path.exists(location) == True: #Tests to see if user entered a valid path
        file_extension = raw_input("What is the file type (.txt for example)? ")
        search_for(location,file_extension)
    else:
        print "I'm sorry, but the file location you have entered does not exist. Please try again."
        start()

def search_for(location,file_extension):
    querylist = []
    n = 5
    while n == 5:
        search_query = raw_input("What would you like to search for in each file? Use'Done' to indicate that you have finished your request. ")
        #list = ["CD90-N5722-15C", "CD90-NB810-4C", "CP90-N2475-8", "CD90-VN530-22B"]
        if search_query == "Done":
            print "Your queries are:",querylist
            print ""
            content = os.listdir(location)
            run(content,file_extension,location,querylist)
            n = 0
        else:
            querylist.append(search_query)
            continue


def run(content,file_extension,location,querylist):
    for item in content:
        if item.endswith(file_extension):
            search(location,item,querylist)
    quit()

def search(location,item,querylist):
    with open(os.path.join(location,item), 'r') as f:
        countlist = []
        for search in querylist: #any search value after the first one is incorrectly reporting "0"
            countsearch = 0
            for line in f:
                if search in line:
                    countsearch = countsearch + 1
            countlist.append(search)
            countlist.append(countsearch) #mechanism to update countsearch is not working for any value after the first
        print item, countlist

start()

如果我使用该代码，则最后一部分（def搜索）无法正常工作。每当我进行搜索时，在我输入的第一个搜索之后的任何搜索都返回＆＃34; 0＆＃34;，尽管文件中出现了多达500,000个搜索词。

我也想知道，因为我必须索引5个文件，每个文件有1,000,000行，如果有办法，我可以写一个额外的功能或者计算多少次＆＃34;生菜＆＃34;发生在所有文件上。

由于文件的大小和内容，我无法在此处发布文件。任何帮助将不胜感激。

修改

我这里也有这段代码。如果我使用它，我得到每个的正确计数，但让用户能够输入任意数量的搜索会更好：

def check_start():
    #location = raw_input("What is the folder containing the data you like processed located? ")
    location = "C:/Code/Samples/Dates/2015-06-07/Large-Scale Data Parsing/Data Files"
    content = os.listdir(location)
    for item in content:
        if item.endswith("processed"):
             countcol1 = 0
             countcol2 = 0
             countcol3 = 0
             countcol4 = 0
             #print os.path.join(currentdir,item)
             with open(os.path.join(location,item), 'r') as f:
                  for line in f:
                      if "CD90-N5722-15C" in line:
                          countcol1 = countcol1 + 1
                      if "CD90-NB810-4C" in line:
                          countcol2 = countcol2 + 1
                      if "CP90-N2475-8" in line:
                          countcol3 = countcol3 + 1
                      if "CD90-VN530-22B" in line:
                          countcol4 = countcol4 + 1
             print item, "CD90-N5722-15C", countcol1, "CD90-NB810-4C", countcol2, "CP90-N2475-8", countcol3, "CD90-VN530-22B", countcol4

Answer 1

您试图多次迭代您的文件。第一次之后，文件指针位于末尾，因此后续搜索将失败，因为没有任何内容可供阅读。

如果添加以下行：

f.seek(0)，这将在每次读取之前重置指针：

def search(location,item,querylist):
    with open(os.path.join(location,item), 'r') as f:
        countlist = []
        for search in querylist: #any search value after the first one is incorrectly reporting "0"
            countsearch = 0
            for line in f:
                if search in line:
                    countsearch = countsearch + 1
            countlist.append(search)
            countlist.append(countsearch) #mechanism to update countsearch is not working for any value after the first
            f.seek(0)
    print item, countlist

PS。我已经猜到了缩进...你真的不应该使用标签。

Answer 2

我不确定我是否完全接受了你的问题，但这样的事情怎么样？

def check_start():

    raw_search_terms = raw_input('Enter search terms seperated by a comma:')
    search_term_list = raw_search_terms.split(',')

    #location = raw_input("What is the folder containing the data you like processed located? ")
    location = "C:/Code/Samples/Dates/2015-06-07/Large-Scale Data Parsing/Data Files"
    content = os.listdir(location)

    for item in content:
        if item.endswith("processed"):
            # create a dictionary of search terms with their counts (initialized to 0)
            search_term_count_dict = dict(zip(search_term_list, [0 for s in search_term_list]))

            for line in f:
                for s in search_term_list:
                    if s in line:
                        search_term_count_dict[s] += 1



        print item
        for key, value in search_term_count_dict.iteritems() :
            print key, value

多字搜索无法正常工作（Python）

2 个答案: