Question

我正在尝试从字典将大量数据写入csv文件，但是在大约一百万行数据之后写入停止。以下是代码：

import os
from nltk import ngrams

with open('four_grams.csv', 'w') as f:
for i in os.listdir(r'C:\Users\rocki\Downloads\Compressed\train'):
    if i.endswith('.bytes'):
        with open(i) as file:
            content=file.read()
            new_content = ' '.join([w for w in content.split() if len(w)<3])
            four_grams=ngrams(new_content.split(), 4)
            grams_dict={}
            for grams in four_grams:
                gram=' '.join(grams)
                if gram not in grams_dict:
                    grams_dict[gram]=1
                else:
                    grams_dict[gram]=grams_dict[gram]+1                    
                for key in grams_dict.keys():
                    f.write("%s,%s\n"%(key,grams_dict[key]))

关于如何实现这一目标的任何建议？

Answer 1

我认为您将要使用Pandas来编写csv。此代码假定每个grams_dict的结构相同。我还没有在大型csv上写熊猫p死。希望它会为您顺利运行！

import pandas as pd

saved_dfs = [] # Create an empty list where we will save each new dataframe (grams_dict) created.

for i in os.listdir(r'C:\Users\rocki\Downloads\Compressed\train'):
    if i.endswith('.bytes'):
        with open(i) as file:
            content=file.read()
            new_content = ' '.join([w for w in content.split() if len(w)<3])
            four_grams=ngrams(new_content.split(), 4)
            grams_dict={}
            for grams in four_grams:
                gram=' '.join(grams)
                if gram not in grams_dict:
                    grams_dict[gram]=1
                else:
                    grams_dict[gram]=grams_dict[gram]+1
            df = pd.DataFrame(data=grams_dict) # create a new DataFrame for each file opened
            saved_dfs.append(df)

final_grams_dict = pd.concat(saved_dfs) # Combine all of the saved grams_dict's into one DataFrame Object

final_grams_dict.to_csv('path.csv')

祝你好运！

Answer 2

您确定您知道代码在哪里阻塞（或文件查看器）吗？您正在谈论数百万行，因此您的代码很可能会阻塞.split()中的列表。众所周知，列表在变大时会变慢。没有任何有关实际数据的提示，就无法知道。

无论如何，这是一个限制列表大小的版本。为了使其成为可运行的示例，您的实际io被一些伪造的行代替。

import os
from nltk import ngrams
from io import StringIO
from collections import defaultdict

string_file = """
1 2 3 a b c ab cd ef
4 5 6 g h i gh ij kl
abcde fghijkl
"""

read_lines = 2 # choose something that does not make too long lists for .split()
csvf = StringIO()
#with open('four_grams.csv', 'wb') as csvf:
if True: # just for indention from with...
#    for i in os.listdir(r'C:\Users\rocki\Downloads\Compressed\train'):
    for i in range(1): # for the indention
#        if i.endswith('.bytes'):
#            with open(i) as bfile:
                bfile = StringIO(string_file)
                # get hold of line count
                chunks = bfile.read().count('\n') // read_lines
                bfile.seek(0)
                memory_line = ''
                grams_dict = defaultdict(int)
                for j in range(chunks):
                    tmp = bfile.readlines(read_lines)
                    content = ' '.join([memory_line] + tmp)
                    memory_line = tmp[-1]
                    new_content = ' '.join([w for w in content.split() if len(w)<3])
                    four_grams = ngrams(new_content.split(), 4)
                    for grams in four_grams:
                        #print(grams, len(grams_dict))
                        gram=' '.join(grams)
                        grams_dict[gram] += 1
                for k, v in grams_dict.items():
                    # assuming that it's enough to write the dict
                    # when it's filled rather than duplicating info
                    # in the resulting csv
                    csvf.write("%s\t%s\n"%(k, v))
                csvf.flush() # writes buffer if anything there
#print(grams_dict)

如果确实是您的字典太大，则也应将其除以。做到这一点的一种方法是制作一个2级字典，并使用string.ascii_letters作为第一个键，而作为第2级，则将grams_dict放到只保留以相应单个字符开头的键。

最后，可以跳过对memory_line的使用，当它存在时，它将对其中的任何内容进行重复计数，但是如果您的read_lines是一个相当大的数字，我就不会打扰那个。

Answer 3

结果是不是不是程序未成功写入，而是excel文件无法完全加载如此大的数据。使用定界测试来检查数据是否完全按要求写入。

Answer 4

好像您一次写每一行。这可能会导致I / O问题。

尝试每次写几行，而不是一次写一行。尝试每次写2行，如果停止则添加一行。

如何使用python将具有大量数据的字典写入csv文件？

4 个答案: