我一直在csv文件上使用pandas从中获取一些值。我的数据如下:
"A",23.495,41.995,"this is a sentence with some words"
"B",52.243,0.118,"More text but contains WORD1"
"A",119.142,-58.289,"Also contains WORD1"
"B",423.2535,292.3958,"Doesn't contain anything of interest"
"C",12.413,18.494,"This string contains WORD2"
我有一个简单的脚本来读取csv并按组创建WORD的频率,因此输出如下:
group freqW1 freqW2
A 1 0
B 1 0
C 0 1
然后对值进行一些其他操作。问题是现在我必须处理无法在内存中保存的非常大的csv文件(20+ GB)。我在pd.read_csv中尝试了chunksize = x选项,但因为' TextFileReader'对象不可订阅,我无法对块进行必要的操作。
我怀疑有一些简单的方法来迭代csv并做我想做的事。
我的代码是这样的:
df = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"])
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
outfile = open("csv_out.txt","w", encoding='utf-8')
df1.to_csv(outfile, sep=",")
outfile.close()
答案 0 :(得分:1)
您可以在chunksize
来电中指定read_csv
选项。见here for details
或者,您可以使用Python csv库并创建自己的csv Reader或DictReader,然后使用它来读取您选择的任何块大小的数据。
答案 1 :(得分:0)
好吧我误解了chunk参数。我解决了这个问题:
frame = pd.DataFrame()
chunks = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"],chunksize=1000000)
for df in chunks:
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
frame = frame.add(df1,fill_value=0)
outfile = open("csv_out.txt","w", encoding='utf-8')
frame.to_csv(outfile, sep=",")
outfile.close()