我有一个数据框如下:
A B C
a d '1.1'
a d ' 2 '
a e '1'
a e ' 3 '
c f '3.2 '
我需要的是对C列中的值进行求和,同时将它们与A和B分组。但是,值是字符串而不是浮点数,有些是空格而有些则没有。
我需要数据帧像这样结束:
A B C
a d 1.1+2
a e 1+3
c f 3.2
我试图做的是:
df.groupby(['A','B']).sum()
然而,由于它们是字符串,它只是将它们合并在一起并且实际上并没有对它们进行求和。之后,我试图将它们转换为浮动,但由于空格,它不允许我。最后,我试图删除字符串,但它说它不能与某些元素一起运行,因为它们是整数(??)。我猜测后者是因为它没有空格。
注意:值为“+”以便更好地理解,但我需要的结果是3.1,4和3.2
我所拥有的显式csv将是这样的:
DL_INSTITUCION,PERIODO_QUE_SE_REPORTA, RESPONSABILIDAD_TOTAL
Santander,201412,"92,467"
Banca Mifel,201412," 39,089 "
Banca Mifel,201412," 28,286 "
Banca Mifel,201412," 310,902 "
CIBanco,201412," 10,106 "
CIBanco,201412," 46,872 "
Banorte/Ixe,201412," 3,127,120 "
CIBanco,201412," 10,163 "
Santander,201412," 545,027 "
Banca Mifel,201412," 10,291 "
Banca Mifel,201412," 80,738 "
Banca Mifel,201412," 46,329 "
HSBC,201412," 583,274 "
CIBanco,201412," 24,094 "
虽然这是2800万行。
答案 0 :(得分:4)
简单pd.to_numeric
此解决方案的优点在于pd.to_numeric
的简洁和高效
这是有效的,因为pd.to_numeric
如果传递了pd.Series
对象,则返回带有索引的pd.Series
对象。这为我们提供了将结果轻松传递到groupby
所需的便利。
pd.to_numeric(df.C).groupby([df.A, df.B]).sum()
A B
a d 3.1
e 4.0
c f 3.2
Name: C, dtype: float64
<强> errors='coerce'
强>
还有一个额外的好处是,如果我们需要处理无法解析为float
的字符串,我们可以使用参数errors='coerce'
。这将强制不可解析的字符串np.nan
,并仍然允许有用的聚合。
pd.to_numeric(df.C, errors='coerce').groupby([df.A, df.B]).sum()
处理逗号
pd.to_numeric(df.C.str.replace(',', ''), 'coerce').groupby([df.A, df.B]).sum()
设置
df = pd.DataFrame(dict(
A=list('aaaac'),
B=list('ddeef'),
C='1.1| 2 |1| 3 |3.2 '.split('|')
))
您可以使用pd.read_csv
from io import StringIO
import pandas as pd
txt = """DL_INSTITUCION,PERIODO_QUE_SE_REPORTA, RESPONSABILIDAD_TOTAL
Santander,201412,"92,467"
Banca Mifel,201412," 39,089 "
Banca Mifel,201412," 28,286 "
Banca Mifel,201412," 310,902 "
CIBanco,201412," 10,106 "
CIBanco,201412," 46,872 "
Banorte/Ixe,201412," 3,127,120 "
CIBanco,201412," 10,163 "
Santander,201412," 545,027 "
Banca Mifel,201412," 10,291 "
Banca Mifel,201412," 80,738 "
Banca Mifel,201412," 46,329 "
HSBC,201412," 583,274 "
CIBanco,201412," 24,094 "
"""
df = pd.read_csv(StringIO(txt), skipinitialspace=True, thousands=',')
您现在已经注意到dtypes
已被正确推断
df.dtypes
DL_INSTITUCION object
PERIODO_QUE_SE_REPORTA int64
RESPONSABILIDAD_TOTAL int64
dtype: object
我们可以毫无问题地进行聚合。
df.groupby(['DL_INSTITUCION', 'PERIODO_QUE_SE_REPORTA']).sum()
RESPONSABILIDAD_TOTAL
DL_INSTITUCION PERIODO_QUE_SE_REPORTA
Banca Mifel 201412 515635
Banorte/Ixe 201412 3127120
CIBanco 201412 91235
HSBC 201412 583274
Santander 201412 637494
答案 1 :(得分:2)
取决于您的目标:
In [65]: x.groupby(['A','B'])['C'].apply(lambda c: c.str.strip().str.cat(sep='+')).reset_index()
Out[65]:
A B C
0 a d 1.1+2
1 a e 1+3
2 c f 3.2
或评估总和:
In [64]: x.groupby(['A','B'])['C'].apply(lambda c: pd.eval(c.str.cat(sep='+'))).reset_index()
Out[64]:
A B C
0 a d 3.1
1 a e 4.0
2 c f 3.2
答案 2 :(得分:2)
编辑:首先替换C列中的逗号
df.C = df.C.str.replace(',', '')
df.C = df.C.astype(np.float)
df.groupby(['A','B']).C.sum().reset_index()
我将最后一行的值更改为&#39; 1,994,102&#39;。你得到了
A B C
0 a d 3.1
1 a e 4.0
2 c f 1994102.0