Question

我处于非常基本的python级别。在这里，我遇到了问题，有人可以帮助我吗？我有一个大熊猫数据框，我想找到行并且意味着，如果每行的第一列有一些相似的值（例如：someinteger用'_'分隔另一个整数）。

我试图使用.split来匹配第一个列表的数量，它适用于单行，但如果我迭代了行，则会抛出错误。我的数据框看起来像：

d = {'ID' : pd.Series(['1_1', '2_1', '1_2', '2_2' ], index=['0','1','2', '3']),
     'one' : pd.Series([2.5, 2, 3.5, 2.5], index=['0','1', '2', '3']),
     'two' : pd.Series([1, 2, 3, 4], index=['0', '1', '2', '3'])}
df2 = pd.DataFrame(d)

要求：

分割后第一个位置具有相似ID的行的平均值。恩。平均值为1_1和1_2,2_1和2_2

输出：

 ID  one  two
0  1  3    2
1  2  2.25 3

这是我的代码，工作版：((df2.ix[0,0]).split('_'))[0]

错误版本：

 for i in df2.iterrows():
                   df2[df2.columns[((df2.ix[0,0]).split('_'))[0] == ((df2.ix[0,0]).split('_'))[0]]]

期待早日回复.. 提前谢谢..

Answer 1

您可以使用[str种方法](http://pandas.pydata.org/pandas-docs/stable/text.html#splitting-and-replacing-strings) and then use groupby`方法，使用第一个ID列创建新列：

df['groupedID'] = df.ID.str.split('_').str.get(0)

In [347]: df
Out[347]:
     ID  one  two groupedID
0  10_1  2.5    1        10
1   2_1  2.0    2         2
2  10_2  3.5    3        10
3   2_2  2.5    4         2

df1 = df.groupby('groupedID').mean()

In [349]: df1
Out[349]:
            one  two
groupedID
10         3.00    2
2          2.25    3

如果您需要将索引名称更改回“ID”：

df1.index.name = 'ID'

In [351]: df1
Out[351]:
     one  two
ID
10   3.00    2
2   2.25    3

查找pandas数据帧的分组行的平均值

1 个答案: