Question

使用这样的数据集

df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_id','module_id','week'])

我们经常看到这种模式：

df.groupby(['user_id'])['module_id'].count().to_frame().reset_index().rename({'module_id':'count'}, axis='columns')

但是我们从

得到完全相同的结果

df.groupby(['user_id'])['module_id'].count().reset_index(name='count')

（注意，我们需要前一个附加的rename，因为系列（here）上的reset_index包含一个name参数并返回一个数据帧，而{{1 }}在DataFrame（here）上不包含reset_index参数。）

首先使用name有什么优势吗？

（我想知道这是否可能是早期版本的熊猫的人工制品，但看起来不太可能：

to_frame已于2012年1月27日添加到this commit中。
Series.reset_index已于2013年10月13日添加到this commit。

因此Series.to_frame在Series.reset_index之前已有一年的可用时间。）

Answer 1

使用Dim intVal as Long intVal = FromHex(hexVal)没有明显的优势。两种方法均可用于获得相同的结果。在大熊猫中，通常使用多种方法来解决问题。我能想到的唯一优点是，对于较大的数据集，在重置索引之前先具有数据框视图可能更方便。如果我们以您的数据框为例，您会发现to_frame()显示了一个数据框视图，该视图对于了解整洁的数据框表v / s to_frame()系列而言可能有用。另外，count的使用对于初次查看您的代码的新用户来说，意图更加清晰。

示例数据框：

to_frame()

In [7]: df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_i ...: d','module_id','week']) In [8]: df.head() Out[8]: user_id module_id week 0 3 4 4 1 1 3 4 2 1 2 2 3 1 3 4 4 1 2 2函数返回一个Series：

count()

使用In [18]: test1 = df.groupby(['user_id'])['module_id'].count() In [19]: type(test1) Out[19]: pandas.core.series.Series In [20]: test1 Out[20]: user_id 0 2 1 7 2 4 3 6 4 1 Name: module_id, dtype: int64 In [21]: test1.index Out[21]: Int64Index([0, 1, 2, 3, 4], dtype='int64', name='user_id')可以明确表明您打算将Series转换为Dataframe。此处的索引为to_frame：

user_id

现在，我们重置索引并使用Dataframe.rename重命名该列。正如您正确指出的那样，In [22]: test1.to_frame() Out[22]: module_id user_id 0 2 1 7 2 4 3 6 4 1没有一个Dataframe.reset_index()参数，因此，我们将必须显式重命名该列。

name

现在让我们看看另一种情况。我们将使用相同的In [24]: testdf1 = test1.to_frame().reset_index().rename({'module_id':'count'}, axis='columns') In [25]: testdf1 Out[25]: user_id count 0 0 2 1 1 7 2 2 4 3 3 6 4 4 1系列count()，但将其重命名为test1，以区分这两种方法。换句话说，test2等于test1。

test2

您可以看到两个数据帧都是等效的，在第二种方法中，我们只需要使用In [26]: test2 = df.groupby(['user_id'])['module_id'].count() In [27]: test2 Out[27]: user_id 0 2 1 7 2 4 3 6 4 1 Name: module_id, dtype: int64 In [28]: test2.reset_index() Out[28]: user_id module_id 0 0 2 1 1 7 2 2 4 3 3 6 4 4 1 In [30]: testdf2 = test2.reset_index(name='count') In [31]: testdf1 == testdf2 Out[31]: user_id count 0 True True 1 True True 2 True True 3 True True 4 True True来重置索引并重命名列名，因为reset_index(name='count')确实有一个{{1} }参数。

第二种情况的代码较少，但对于新手来说可读性较低，我更喜欢第一种使用Series.reset_index()的方法，因为它可以使意图明确：“将此计数序列转换为数据框并重命名该列“ module_id”到“ count””。

为什么要在reset_index之前使用to_frame？

1 个答案: