熊猫:如何对特定行进行求和

时间:2017-06-14 12:24:26

标签: python-3.x pandas

假设我有一个Pandas DataFrame,如下所示:

var originalNodes = new List<TreeNode>(); // TreeNodeCollection 
var nodes = new List<TreeNode>();         // TreeNodeCollection 
var parentByName = nodes.ToDictionary(n => n.Text, n => n.Parent);

foreach(var originalNode in originalNodes)
{
    TreeNode parent;
    if (!parentByName.TryGetValue(originalNode.Text, out parent))
    {
        // removed - there is no key for original node name
        continue;
    }

    if (originalNode.Parent?.Text != parent?.Text)
    {
        // moved from originalNode.Parent to parent
        continue;
    }
}

// these guys are added
var added = parentByName.Keys.Except(originalNodes.Select(n => n.Text))

我想将其转换为:

category  sentences
Data1     String1
NaN       String2
NaN       String3
Data2     String1
NaN       String4
Data2     String1
NaN       String6
NaN       String7
Data3     String1
NaN       String8
NaN       String9

从标题中可以看出,右列是完整对话的句子,左栏是各自的类别。我在这里尝试做的只是选择带有category sentences Data1 String1 String2 String3 Data2 String1 String4 Data2 String1 String6 String7 Data3 String1 String8 String9 值的行,然后将它们加到前面的行中,直到达到NaN

到目前为止,对我来说这是一次失败,因为我尝试了不同的东西而仍然没有解决方案。我怎么能这样做?

另一个问题:我选择了我的DataFrame(让我们调用String1)并选择前3行并使用返回df的{​​{1}}对它们求和。如果我在末尾添加df[0:3].sum(),我得到的每一行都是零。我尝试Series([], dtype: float64)并返回.sum(axis=1)。我也尝试添加.sum(axis=0),但结果相同。那么,任何人都可以说出我做错了什么以及我应该做些什么?

TL; DR:我想将Series([], dtype: float64)iloc之间的字符串相加,而不包括最后一个String1。有可能这样做,如果是这样,怎么样?

只是一点注意:抱歉格式化。我仍然无法适应它......

3 个答案:

答案 0 :(得分:1)

非最佳,非pythonic和丑陋!但它完成了这项工作:

import pandas as pd

old_table = pd.read_csv('your_table.csv')
new_table = pd.DataFrame([],columns=('category','sentences'))

for ID,row in old_table.iterrows():
    if not pd.isnull(row['category']):
        new_table.loc[len(new_table)] = [row['category'],[row['sentences']]]
    else:
        string = list(new_table.loc[len(new_table)-1]['sentences'])
        string.append(row['sentences'])
        new_table.loc[len(new_table)-1]['sentences'] = string

print(old_table,'\n====\n',new_table)

它给出了:

  category sentences
0      One     hello
1      NaN        my
2      NaN    little
3      NaN    friend
4      Two     hello
5      NaN        to
6      NaN       you
7      NaN       too 
====
   category                    sentences
0      One  [hello, my, little, friend]
1      Two        [hello, to, you, too]

答案 1 :(得分:1)

创建一个临时ID列,用作组键和类别列,然后连接每个组的句子。

df=df.copy()
df['ID'] = df.index.to_series()[df.category.notnull()]
df.fillna(method='ffill')\
  .groupby(['ID','category'])['sentences']\
  .apply(lambda x: ' '.join(x))\
  .reset_index()\
  .drop('ID',1)
Out[59]: 
  category                sentences
0    Data1  String1 String2 String3
1    Data2          String1 String4
2    Data2  String1 String6 String7
3    Data3  String1 String8 String9

答案 2 :(得分:0)

使用来自Series的{​​{1}}(fillna with method ='ffill')的唯一值创建arange,其中notnull值为{groupby 1}}:

ffill

然后category s = df['category'].where(df['category'].isnull(), np.arange(len(df.index))).ffill() 0 0 1 0 2 0 3 3 4 3 5 5 6 5 7 5 8 8 9 8 10 8 Name: category, dtype: int64 agg

s