Question

我正在分析google ngram数据库，可在此处下载： http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

它是一个以制表符分隔的文件。它没有标题，数据看起来像这样。

financial analysis       2000    3       3       1
financial analysis       2002    2       2       1
financial analysis       2004    1       1       1
financial analysis       2005    4       4       3
financial analysis       2006    10      10      7
financial analysis       2007    47      37      17
financial analysis       2008    63      54      31
financial capacity       1899    1       1       1
financial capacity       1997    2       2       2
financial capacity       1998    4       4       2
financial capacity       1999    3       3       3
financial capacity       2000    4       2       2
financial capacity       2003    1       1       1
financial capacity       2004    4       4       3
financial capacity       2005    2       2       2
financial capacity       2006    2       2       2
financial capacity       2007    26      24      17
financial capacity       2008    26      25      19
financial straits        1998    2       2       2
financial straits        1999    1       1       1
financial straits        2000    1       1       1
financial straits        2002    3       3       3
financial straits        2004    1       1       1
financial straits        2005    6       6       6
financial straits        2006    8       8       6
financial straits        2007    8       8       8
financial straits        2008    23      23      20

我要做的是制作一个数据框，只保留计数总和（一年中的列），而不管年份如何。所以我想要的是这样的事情：

financial analysis    110
financial capacity    75
financial straits     53

以下是我的尝试。原始数据分布在100个文件中，因此它以for循环开始。

import pandas as pd
import zipfile

df_ngram = pd.DataFrame(columns=['ngram','count'])
### read files from 0 to 99, open csv in chunks to avoid memory error
for i in range(1):
    z = zipfile.ZipFile("googlebooks-eng-all-2gram-20090715-"+str(i)+".csv.zip")
    reader = pd.read_csv(z.open("googlebooks-eng-all-2gram-20090715-"+str(i)+".csv"), deli$

    ### iterate over chunks and aggregate data
    for chunk in reader:
            agg_chunk = chunk.groupby(['ngram'])['count'].sum()
            print agg_chunk.head(5)
            df_ngram.append(agg_chunk)
            print df_ngram.tail(5)

我使用groupby来聚合我的数据帧，并尝试将结果保存在另一个数据帧（df_ngram）中。但它似乎根本没有附加。以下是我运行时的结果。我不确定如何处理groupby结果。如何汇总groupby的结果？或者我可以在不使用groupby的情况下得到我想要的东西吗？

ngram
"warmongers           5339
"warns               55904
"warplanes            4939
"warranlo              181
"warrantizabimus       107
Name: count, dtype: int64
Empty DataFrame
Columns: [ngram, count]
Index: []
ngram
"wildbores          65
"wildebeest      12003
"wildlooking       318
"wilfnlness         52
"wilrde             79
Name: count, dtype: int64
Empty DataFrame
Columns: [ngram, count]
Index: []
ngram
"Évora           155
"Österreichs     507
"Übers           159
"échappent        84
"égal            537
Name: count, dtype: int64
Empty DataFrame
Columns: [ngram, count]
Index: []

Answer 1

您需要将中间DataFrame附加到列表中，然后连接结果。

在for循环之前，添加df_agg = list()。

而不是df_ngram.append(agg_chunk)，您需要df_agg.append(agg_chunk)

最后，在循环的最末端和外部，您需要：

df_ngram = pd.concat(df_agg, ignore_index=True)

将pandas.DataFrame.GroupBy结果附加到另一个数据帧中

1 个答案: