我将尝试用我的代码解释我想要实现的目标:
如下所示,即使每个目录中有2个目录和一个文件,它也非常慢。 主文件上的每个条目大约需要1秒钟。我在那个文件上有400000条记录......
import csv
import os
rootdir = 'C:\Users\ST\Desktop\Sample'
f = open('C:\Users\ST\Desktop\inputIds.csv')
f.readline()
snipscsv_f=csv.reader(f, delimiter=' ')
for row in snipscsv_f:
print 'processing another ID'
for subdir, dir, files in os.walk(rootdir):
print 'processing another folder'
for file in files:
print 'processing another file'
if 'csv' in file: #i want only csv files to be processed
ft = open(os.path.join(subdir, file))
for ftrow in ft:
if row[0] in ftrow:
print row[0]
ft.close()
答案 0 :(得分:1)
我知道你有一个很大的CSV文件,但是阅读它并进行比较仍然要快得多,而不是为每个条目执行os walk。
另外,不确定python是否是最好的工具。对于这种任务,您可能会发现shell脚本(对于Windows,Powershell是唯一不错的工具)。无论如何,你添加了python标签......
import csv
import fnmatch
import os
# load the csv with entries
with open('file_with_entries.csv','r') as f:
readr = csv.reader(f)
data = []
for row in readr:
data.extend(row)
# find csv files
rootdir = os.getcwd() # could be anywhere
matches = []
for root, dirs, files in os.walk(rootdir):
for filename in fnmatch.filter(files, '*.csv'):
matches.append(os.path.join(root, filename))
# find occurences of entry in each file
for eachcsv in matches:
with open(eachcsv, 'r') as f:
text = f.read()
for entry in data:
if entry in text:
print("found %s in %s" % (entry,eachcsv))
不确定仅只读取条目文件的第一行有多重要,修改代码就可以相当容易。