csv写不能正常工作

时间:2016-10-14 01:03:03

标签: python file csv itertools

任何想法为什么这总是在输出csv中写入相同的行?

 21 files = glob.glob(path)
 22 csv_file_complete = open("graph_complete_reddit.csv", "wb")
 23 stat_csv_file = open("test_stat.csv", "r")
 24 csv_reader = csv.reader(stat_csv_file)
 25 lemmatizer = WordNetLemmatizer()
 26 for file1, file2 in itertools.combinations(files, 2):
 27         with open(file1) as f1:
 28                 print(file1)
 29                 f1_text = f1.read()
 30                 f1_words = re.sub("[^a-zA-Z]", ' ', f1_text).lower().split()
 31                 f1_words = [str(lemmatizer.lemmatize(w, wordnet.VERB)) for w in f1_words if w not in stopwords]
 32                 print(f1_words)
 33         f1.close()
 34         with open(file2) as f2:
 35                 print(file2)
 36                 f2_text = f2.read()
 37                 f2_words = re.sub("[^a-zA-Z]", ' ', f2_text).lower().split()
 38                 f2_words = [str(lemmatizer.lemmatize(w, wordnet.VERB)) for w in f2_words if w not in stopwords]
 39                 print(f2_words)
 40         f2.close()
 41 
 42         a_complete = csv.writer(csv_file_complete, delimiter=',')
 43         print("*****")
 44         print(file1)
 45         print(file2)
 46         print("************************************")
 47 
 48         f1_head, f1_tail = os.path.split(file1)
 49         print("************")
 50         print(f1_tail)
 51         print("**************")
 52         f2_head, f2_tail = os.path.split(file2)
 53         print(f2_tail)
 54         print("********************************")
 55         for row in csv_reader:
 56             if f1_tail in row:
 57                 file1_file_number = row[0]
 58                 file1_category_number = row[2]
 59             if f2_tail in row:
 60                 file2_file_number = row[0]
 61                 file2_category_number = row[2]
 62 
 63         row_complete = [file1_file_number, file2_file_number, file1_category_number, file2_category_number ]
 64         a_complete.writerow(row_complete)
 65 
 66 csv_file_complete.close()

这些印刷品显示不同的文件名!

这是代码用作输入的test_stat.csv文件:

  1 1,1bmmoc.txt,1
  2 2,2b3u1a.txt,1
  3 3,2mf64u.txt,2
  4 4,4x74k3.txt,5
  5 5,lsspe.txt,3
  6 6,qbimg.txt,4
  7 7,w95fm.txt,2

以下是代码输出的内容:

  1 7,4,2,5
  2 7,4,2,5
  3 7,4,2,5
  4 7,4,2,5
  5 7,4,2,5
  6 7,4,2,5
  7 7,4,2,5
  8 7,4,2,5
  9 7,4,2,5
 10 7,4,2,5
 11 7,4,2,5
 12 7,4,2,5
 13 7,4,2,5
 14 7,4,2,5
 15 7,4,2,5
 16 7,4,2,5
 17 7,4,2,5
 18 7,4,2,5
 19 7,4,2,5
 20 7,4,2,5
 21 7,4,2,5

请评论或建议修正。

1 个答案:

答案 0 :(得分:1)

你永远不会倒带stat_csv_file,所以最终,你的csv_reader循环(stat_csv_file的包装)根本就没有循环,而你写下你在最后一个循环中找到的任何内容基本上,逻辑是:

  1. 在第一个循环中,查看所有csv_reader,找到匹配(虽然你发现它时仍然看着,耗尽文件),写命中
  2. 在所有后续循环中,文件已耗尽,因此内部搜索循环甚至不执行,并且您最终写入的值与上次相同
  3. 解决此问题的缓慢但直接的方法是在搜索之前添加stat_csv_file.seek(0)

     53         print(f2_tail)
     54         print("********************************")
                stat_csv_file.seek(0)  # Rewind to rescan input from beginning
     55         for row in csv_reader:
     56             if f1_tail in row:
     57                 file1_file_number = row[0]
     58                 file1_category_number = row[2]
     59             if f2_tail in row:
     60                 file2_file_number = row[0]
     61                 file2_category_number = row[2]
    

    一种可能更好的方法是将输入CSV加载到dict一次,然后根据需要在那里执行查找,避免重复(慢)I / O,支持快速dict查找。成本会更高的内存使用;如果输入的CSV足够小,这不是问题,如果它很大,你可能需要使用适当的数据库来获得快速查找而不会留下内存。

    有点不清楚逻辑应该在这里,因为你的输入和输出不对齐(你的输出应该以重复的数字开头,但是由于某种原因它没有?) 。但如果意图是输入包含file_number, file_tail, category_number,那么您可以使用以下代码开始代码(在顶级循环之上):

    # Create mapping from second field to associated first and third fields
    tail_to_numbers = {ftail: (fnum, cnum) for fnum, ftail, cnum in csv_reader}
    

    然后替换:

        for row in csv_reader:
            if f1_tail in row:
                file1_file_number = row[0]
                file1_category_number = row[2]
            if f2_tail in row:
                file2_file_number = row[0]
                file2_category_number = row[2]
    
        row_complete = [file1_file_number, file2_file_number, file1_category_number, file2_category_number ]
        a_complete.writerow(row_complete)
    

    更简单,更快:

    try:
        file1_file_number, file1_category_number = tail_to_numbers[f1_tail]
        file2_file_number, file2_category_number = tail_to_numbers[f2_tail]
    except KeyError:
        # One of the tails wasn't found in the lookup dict, so don't output
        # (variables would be stale or unset); optionally emit some error to stderr
        continue
    else:
        # Found both tails, output associated values
        row_complete = [file1_file_number, file2_file_number, file1_category_number, file2_category_number]
        a_complete.writerow(row_complete)