任何想法为什么这总是在输出csv中写入相同的行?
21 files = glob.glob(path)
22 csv_file_complete = open("graph_complete_reddit.csv", "wb")
23 stat_csv_file = open("test_stat.csv", "r")
24 csv_reader = csv.reader(stat_csv_file)
25 lemmatizer = WordNetLemmatizer()
26 for file1, file2 in itertools.combinations(files, 2):
27 with open(file1) as f1:
28 print(file1)
29 f1_text = f1.read()
30 f1_words = re.sub("[^a-zA-Z]", ' ', f1_text).lower().split()
31 f1_words = [str(lemmatizer.lemmatize(w, wordnet.VERB)) for w in f1_words if w not in stopwords]
32 print(f1_words)
33 f1.close()
34 with open(file2) as f2:
35 print(file2)
36 f2_text = f2.read()
37 f2_words = re.sub("[^a-zA-Z]", ' ', f2_text).lower().split()
38 f2_words = [str(lemmatizer.lemmatize(w, wordnet.VERB)) for w in f2_words if w not in stopwords]
39 print(f2_words)
40 f2.close()
41
42 a_complete = csv.writer(csv_file_complete, delimiter=',')
43 print("*****")
44 print(file1)
45 print(file2)
46 print("************************************")
47
48 f1_head, f1_tail = os.path.split(file1)
49 print("************")
50 print(f1_tail)
51 print("**************")
52 f2_head, f2_tail = os.path.split(file2)
53 print(f2_tail)
54 print("********************************")
55 for row in csv_reader:
56 if f1_tail in row:
57 file1_file_number = row[0]
58 file1_category_number = row[2]
59 if f2_tail in row:
60 file2_file_number = row[0]
61 file2_category_number = row[2]
62
63 row_complete = [file1_file_number, file2_file_number, file1_category_number, file2_category_number ]
64 a_complete.writerow(row_complete)
65
66 csv_file_complete.close()
这些印刷品显示不同的文件名!
这是代码用作输入的test_stat.csv文件:
1 1,1bmmoc.txt,1
2 2,2b3u1a.txt,1
3 3,2mf64u.txt,2
4 4,4x74k3.txt,5
5 5,lsspe.txt,3
6 6,qbimg.txt,4
7 7,w95fm.txt,2
以下是代码输出的内容:
1 7,4,2,5
2 7,4,2,5
3 7,4,2,5
4 7,4,2,5
5 7,4,2,5
6 7,4,2,5
7 7,4,2,5
8 7,4,2,5
9 7,4,2,5
10 7,4,2,5
11 7,4,2,5
12 7,4,2,5
13 7,4,2,5
14 7,4,2,5
15 7,4,2,5
16 7,4,2,5
17 7,4,2,5
18 7,4,2,5
19 7,4,2,5
20 7,4,2,5
21 7,4,2,5
请评论或建议修正。
答案 0 :(得分:1)
你永远不会倒带stat_csv_file
,所以最终,你的csv_reader
循环(stat_csv_file
的包装)根本就没有循环,而你写下你在最后一个循环中找到的任何内容基本上,逻辑是:
csv_reader
,找到匹配(虽然你发现它时仍然看着,耗尽文件),写命中解决此问题的缓慢但直接的方法是在搜索之前添加stat_csv_file.seek(0)
:
53 print(f2_tail)
54 print("********************************")
stat_csv_file.seek(0) # Rewind to rescan input from beginning
55 for row in csv_reader:
56 if f1_tail in row:
57 file1_file_number = row[0]
58 file1_category_number = row[2]
59 if f2_tail in row:
60 file2_file_number = row[0]
61 file2_category_number = row[2]
一种可能更好的方法是将输入CSV加载到dict
一次,然后根据需要在那里执行查找,避免重复(慢)I / O,支持快速dict
查找。成本会更高的内存使用;如果输入的CSV足够小,这不是问题,如果它很大,你可能需要使用适当的数据库来获得快速查找而不会留下内存。
有点不清楚逻辑应该在这里,因为你的输入和输出不对齐(你的输出应该以重复的数字开头,但是由于某种原因它没有?) 。但如果意图是输入包含file_number, file_tail, category_number
,那么您可以使用以下代码开始代码(在顶级循环之上):
# Create mapping from second field to associated first and third fields
tail_to_numbers = {ftail: (fnum, cnum) for fnum, ftail, cnum in csv_reader}
然后替换:
for row in csv_reader:
if f1_tail in row:
file1_file_number = row[0]
file1_category_number = row[2]
if f2_tail in row:
file2_file_number = row[0]
file2_category_number = row[2]
row_complete = [file1_file_number, file2_file_number, file1_category_number, file2_category_number ]
a_complete.writerow(row_complete)
更简单,更快:
try:
file1_file_number, file1_category_number = tail_to_numbers[f1_tail]
file2_file_number, file2_category_number = tail_to_numbers[f2_tail]
except KeyError:
# One of the tails wasn't found in the lookup dict, so don't output
# (variables would be stale or unset); optionally emit some error to stderr
continue
else:
# Found both tails, output associated values
row_complete = [file1_file_number, file2_file_number, file1_category_number, file2_category_number]
a_complete.writerow(row_complete)