我想在pandas数据框中创建一列,以添加其他列的值(为0或1s)。该列称为“总和”
我的HEADPandas看起来像:
Application AnsSr sum Col1 Col2 Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11 0.0 0.0 28/12/11 .... ...Dates... 28/12/11
~00c 0 0.0 0.0 0 0 0 .... 0 0 0
~00pr 0 0.0 0.0 0 0 0 .... 0 0 0
~00te 0 0.0 0.0 0 0 1 .... 0 0 1
预期结果(假设不再有列
Application AnsSr sum Col1 Col2 Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11 0.0 nan 28/12/11 .... ...Dates... 28/12/11
~00c 0 0.0 0.0 0 0 0 .... 0 0 0
~00pr 0 0.0 0.0 0 0 0 .... 0 0 0
~00te 0 0.0 2 0 0 1 .... 0 0 1
如您所见,即使某些列中的值为1s,“ sum”的值仍保持为0。 我在做什么错了?
代码的基础是:
theMatrix=pd.DataFrame([datetime.today().strftime('%Y-%m-%d')],['Date'],['Application'])
theMatrix['Ans'] = 0
theMatrix['sum'] = 0
到目前为止一切都很好 然后我用loc添加所有值。 然后我想用
theMatrix.fillna(0, inplace=True)
# this being the key line:
theMatrix['sum'] = theMatrix.sum(axis=1)
theMatrix.sort_index(axis=0, ascending=True, inplace=True)
如您在结果中所见(附图),总和仍为0。 我看了一下here或here以及熊猫documentation无济于事。 实际上是表达式:
theMatrix['sum'] = theMatrix.sum(axis=1)
我从那里得到的。
通过以下方式更改最后一行:
theMatrix['sum'] = theMatrix[3:0].sum(axis=1)
为了避免对前三列求和,得出结果:
Application AnsSr sum Col1 Col2 Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11 0.0 nan 28/12/11 .... ...Dates... 28/12/11
~00c 0 0.0 nan 1 1 0 .... 0 0 0
~00pr 0 0.0 1.0 0 0 0 .... 0 0 1
~00te 0 0.0 0 0 0 0 .... 0 0 0
请注意两件事: a)'〜00c'行的总和是nan,但该行中有1。 b)在计算总和之前,代码theMatrix.fillna(0,inplace = True)应该将所有可能的nan都更改为0,所以总和永远都不应该是nan,因为理论上在任何列中都没有nan值[3: ]
那行不通。
有个主意吗?
谢谢
PS:更高版本,以防万一您想知道如何填充数据帧:读取和解析XML,行如下:
# myDocId being the name of the columns
# concept being the index.
theMatrix.loc[concept,myDocId]=1
答案 0 :(得分:1)
如果我理解正确,那么可以为您提供帮助:
import pandas as pd
import datetime
#create dataframe following your example
theMatrix=pd.DataFrame([datetime.datetime.today().strftime('%Y-%m-%d')],['Date'],['Application'])
theMatrix['Ans'] = 0
theMatrix['col1'] = 1
theMatrix['col2'] = 1
# create 'sum' column with summed values from certain columns
theMatrix['sum'] = theMatrix['col1'] + theMatrix['col2']
答案 1 :(得分:1)
您选择求和的任何数据,只需将其添加到列表中,然后使用该列表提供给sum函数(轴= 1)。这将为您提供所需的结果。这是与您的数据有关的示例。
样本文件数据:
Date,a,b,c
bad, bad, bad, bad # Used to simulate your data better
2018-11-19,1,0,0
2018-11-20,1,0,0
2018-11-21,1,0,1
2018-11-23,1,nan,0 # Nan here is just to represent the missing data
2018-11-28,1,0,1
2018-11-30,1,nan,1 # Nan here is just to represent the missing data
2018-12-02,1,0,1
代码:
import pandas as pd
df = pd.read_csv(yourdata.filename) # Your method of loading the data
#rows_to_sum = ['a','b','c'] # The rows you wish to summarize
rows_to_sum = df.columns[1:] # Alternate method to select remainder of rows.
df = df.fillna(0) # used to fill the NaN you were talking about below.
df['sum'] = df[rows_to_sum][1:].astype(int).sum(axis=1) # skip the correct amount of rows here.
# Also, the use of astype(int), is due to the bad data read from the top. So redefining it here, allows you to sum it appropriately.
print(df)
输出:
Date a b c sum
bad bad bad bad NaN
2018-11-19 1 0 0 1.0
2018-11-20 1 0 0 1.0
2018-11-21 1 0 1 2.0
2018-11-23 1 0 0 1.0
2018-11-28 1 0 1 2.0
2018-11-30 1 0 1 2.0
2018-12-02 1 0 1 2.0