我正在分析google ngram数据库,可在此处下载: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
它是一个以制表符分隔的文件。它没有标题,数据看起来像这样。
financial analysis 2000 3 3 1
financial analysis 2002 2 2 1
financial analysis 2004 1 1 1
financial analysis 2005 4 4 3
financial analysis 2006 10 10 7
financial analysis 2007 47 37 17
financial analysis 2008 63 54 31
financial capacity 1899 1 1 1
financial capacity 1997 2 2 2
financial capacity 1998 4 4 2
financial capacity 1999 3 3 3
financial capacity 2000 4 2 2
financial capacity 2003 1 1 1
financial capacity 2004 4 4 3
financial capacity 2005 2 2 2
financial capacity 2006 2 2 2
financial capacity 2007 26 24 17
financial capacity 2008 26 25 19
financial straits 1998 2 2 2
financial straits 1999 1 1 1
financial straits 2000 1 1 1
financial straits 2002 3 3 3
financial straits 2004 1 1 1
financial straits 2005 6 6 6
financial straits 2006 8 8 6
financial straits 2007 8 8 8
financial straits 2008 23 23 20
我要做的是制作一个数据框,只保留计数总和(一年中的列),而不管年份如何。 所以我想要的是这样的事情:
financial analysis 110
financial capacity 75
financial straits 53
以下是我的尝试。原始数据分布在100个文件中,因此它以for循环开始。
import pandas as pd
import zipfile
df_ngram = pd.DataFrame(columns=['ngram','count'])
### read files from 0 to 99, open csv in chunks to avoid memory error
for i in range(1):
z = zipfile.ZipFile("googlebooks-eng-all-2gram-20090715-"+str(i)+".csv.zip")
reader = pd.read_csv(z.open("googlebooks-eng-all-2gram-20090715-"+str(i)+".csv"), deli$
### iterate over chunks and aggregate data
for chunk in reader:
agg_chunk = chunk.groupby(['ngram'])['count'].sum()
print agg_chunk.head(5)
df_ngram.append(agg_chunk)
print df_ngram.tail(5)
我使用groupby来聚合我的数据帧,并尝试将结果保存在另一个数据帧(df_ngram)中。但它似乎根本没有附加。以下是我运行时的结果。我不确定如何处理groupby结果。如何汇总groupby的结果?或者我可以在不使用groupby的情况下得到我想要的东西吗?
ngram
"warmongers 5339
"warns 55904
"warplanes 4939
"warranlo 181
"warrantizabimus 107
Name: count, dtype: int64
Empty DataFrame
Columns: [ngram, count]
Index: []
ngram
"wildbores 65
"wildebeest 12003
"wildlooking 318
"wilfnlness 52
"wilrde 79
Name: count, dtype: int64
Empty DataFrame
Columns: [ngram, count]
Index: []
ngram
"Évora 155
"Österreichs 507
"Übers 159
"échappent 84
"égal 537
Name: count, dtype: int64
Empty DataFrame
Columns: [ngram, count]
Index: []
答案 0 :(得分:0)
您需要将中间DataFrame附加到列表中,然后连接结果。
在for
循环之前,添加df_agg = list()
。
而不是df_ngram.append(agg_chunk)
,您需要df_agg.append(agg_chunk)
最后,在循环的最末端和外部,您需要:
df_ngram = pd.concat(df_agg, ignore_index=True)