如何在文件中并行搜索字符串

时间:2014-02-26 11:11:47

标签: python

我编写了这个简单的代码来搜索文件集合中的字符串。它起作用但不是最佳的。

  1. 我需要在字典中对文件名进行硬编码
  2. 我的一些文件大小约为60Mb,搜索持续一段时间
  3. 有人可以针对以下内容优化我的代码:

    • 读取给定目录中的所有文件,而无需对文件名进行硬编码
    • 并行搜索速度
    • 将搜索结果写入output.txt文件

      my_file = {"File1.xml", "File2.xml", "File3.xml"} 
      my_string = {"John", "Mary", "Clara"}
      
       for f in my_file:
      
          for s in my_string:
              with open(f) as fp:
                  a = fp.read().count(s)
                  fp.close()
              print f,',',s,',',a
      

    谢谢

1 个答案:

答案 0 :(得分:0)

1。阅读文件:

files_queue = Queue()

for root, dirs, files in os.walk(start_path):
    for file in files:
        files_queue.put(file)

2。并行搜索:

res_queue = Queue()
threads = []

def search(files_queue, words, res_queue):
    while True:
        file = files_queue.get(block=Flase)

        with open(file) as fp:
            content = fp.read()
            results = {}

            for word in words
                results[word] = content.count(s)

        res_queue.put(results)
        files_queue.task_done()

# use 10 workers   
for _ in range(10):
    thread = Thread(target=search, args=(files_queue, words, res_queue) 
    threads.append(thread)
    thread.start()

3。收集结果:

# wait until all files processed
files.join() 

# collect results from queue
results = []
while not res_queue.empty()
   results.append( res_queue.get() )

# profit
...