如何为数据集子集重复相同的操作

时间:2015-08-04 14:19:28

标签: python pandas

我有这个pandas数据帧:

data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : random.randn(8), 'D' : random.randn(8)})

Out[84]: 
     A      B         C         D
0  foo    one  0.007861 -0.451943
1  bar    one -1.341386 -0.799740
2  foo    two -0.290606 -0.445757
3  bar  three  0.519251 -0.404406
4  foo    two -0.627547 -0.784901
5  bar    two  0.309421  0.234292
6  foo    one -2.156879  0.898375
7  foo  three -1.669896  0.498978

我所做的是应用此函数来获取B

中重复元素的数量
data['Counts'] = data.groupby(['B'])['B'].transform('count')

这给了我:

    Out[87]: 
    A      B         C         D  Counts
0  foo    one  0.007861 -0.451943       3
1  bar    one -1.341386 -0.799740       3
2  foo    two -0.290606 -0.445757       3
3  bar  three  0.519251 -0.404406       2
4  foo    two -0.627547 -0.784901       3
5  bar    two  0.309421  0.234292       3
6  foo    one -2.156879  0.898375       3
7  foo  three -1.669896  0.498978       2

然后我创建了一个新列作为布尔分类器,其中1表示重复至少一次的行,0表示不重复的行(本例中不为0)

data.ix[data.Counts >= 2,'Repeat'] = 1 
data.ix[data.Counts <= 1,'Repeat'] = 0

Out[89]: 
     A      B         C         D  Counts  Repeat
0  foo    one  0.007861 -0.451943       3       1
1  bar    one -1.341386 -0.799740       3       1
2  foo    two -0.290606 -0.445757       3       1
3  bar  three  0.519251 -0.404406       2       1
4  foo    two -0.627547 -0.784901       3       1
5  bar    two  0.309421  0.234292       3       1
6  foo    one -2.156879  0.898375       3       1
7  foo  three -1.669896  0.498978       2       1

我想要获得的是另一个Count列,它计算B中元素在A中具有相同值时重复的次数,并根据此值,使用布尔分类器对它们进行分类。这将是:

Out[89]: 
     A      B         C         D  Counts  Repeat CountsInsideA RepeatInsideA
0  foo    one  0.007861 -0.451943       3       1             2              1
1  bar    one -1.341386 -0.799740       3       1             1              0
2  foo    two -0.290606 -0.445757       3       1             2              1
3  bar  three  0.519251 -0.404406       2       1             1              0
4  foo    two -0.627547 -0.784901       3       1             2              1
5  bar    two  0.309421  0.234292       3       1             1              0
6  foo    one -2.156879  0.898375       3       1             2              1
7  foo  three -1.669896  0.498978       2       1             1              0

3 个答案:

答案 0 :(得分:1)

检查一下,首先您可以使用repeat制作np.where列,这不简洁。 2,要计算特定AB组合的重复次数,我们可能需要使用gourpby,并将结果与​​原始DataFrame合并:

In [19]:

data = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 
                     'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 
                     'C' : np.random.randn(8), 'D' : np.random.randn(8)})
In [20]:

data['Counts'] = data.groupby(['B'])['B'].transform('count')
print data
     A      B         C         D  Counts
0  foo    one -0.973299 -0.248367       3
1  bar    one  0.518526  0.987810       3
2  foo    two -0.031224  0.340774       3
3  bar  three -0.146824 -0.751124       2
4  foo    two -0.748681 -0.128536       3
5  bar    two  0.744051  0.604505       3
6  foo    one -0.513386  1.262674       3
7  foo  three  0.044814  0.810772       2
In [21]:

data['Repeat'] = np.where(data.Counts>1, 1, 0)
print data
     A      B         C         D  Counts  Repeat
0  foo    one -0.973299 -0.248367       3       1
1  bar    one  0.518526  0.987810       3       1
2  foo    two -0.031224  0.340774       3       1
3  bar  three -0.146824 -0.751124       2       1
4  foo    two -0.748681 -0.128536       3       1
5  bar    two  0.744051  0.604505       3       1
6  foo    one -0.513386  1.262674       3       1
7  foo  three  0.044814  0.810772       2       1
In [23]:

data = pd.merge(left=data,
                right=pd.DataFrame(data.groupby(['A','B']).size(), 
                                   columns=['CountsInsideA']).reset_index(),
                on=['A', 'B'],
                how='left')
print data
     A      B         C         D  Counts  Repeat  CountsInsideA
0  foo    one -0.973299 -0.248367       3       1              2
1  bar    one  0.518526  0.987810       3       1              1
2  foo    two -0.031224  0.340774       3       1              2
3  bar  three -0.146824 -0.751124       2       1              1
4  foo    two -0.748681 -0.128536       3       1              2
5  bar    two  0.744051  0.604505       3       1              1
6  foo    one -0.513386  1.262674       3       1              2
7  foo  three  0.044814  0.810772       2       1              1
In [25]:

data['RepeatInsideA'] = np.where(data.CountsInsideA>1, 1, 0)
print data
     A      B         C         D  Counts  Repeat  CountsInsideA  RepeatInsideA
0  foo    one -0.973299 -0.248367       3       1              2              1 
1  bar    one  0.518526  0.987810       3       1              1              0
2  foo    two -0.031224  0.340774       3       1              2              1
3  bar  three -0.146824 -0.751124       2       1              1              0
4  foo    two -0.748681 -0.128536       3       1              2              1
5  bar    two  0.744051  0.604505       3       1              1              0
6  foo    one -0.513386  1.262674       3       1              2              1
7  foo  three  0.044814  0.810772       2       1              1              0

答案 1 :(得分:1)

对于重复列,您可以检查data['Count']是否大于1,如果是,它将返回True / False值,您可以将其转换为int,它们将分别变为1或0。示例 -

In [20]: data['Repeat'] = (data['Counts'] > 1).astype(int)

In [21]: data
Out[21]:
     A      B         C         D  Counts  Repeat
0  foo    one -0.976018 -1.887011       3       1
1  bar    one -0.481183  2.937111       3       1
2  foo    two -0.702470 -0.328288       3       1
3  bar  three  0.579954 -2.742163       2       1
4  foo    two  2.125964 -0.689301       3       1
5  bar    two  0.699109 -0.380017       3       1
6  foo    one -1.667972  0.990599       3       1
7  foo  three -1.937627 -0.834636       2       1

对于CountsInsideA列,您可以使用与Count相同的逻辑,仅groupby使用A以及B,示例 -

In [22]: data['CountsInsideA'] = data.groupby(['A','B'])['B'].transform('count')

In [23]: data
Out[23]:
     A      B         C         D  Counts  Repeat  CountsInsideA
0  foo    one -0.976018 -1.887011       3       1              2
1  bar    one -0.481183  2.937111       3       1              1
2  foo    two -0.702470 -0.328288       3       1              2
3  bar  three  0.579954 -2.742163       2       1              1
4  foo    two  2.125964 -0.689301       3       1              2
5  bar    two  0.699109 -0.380017       3       1              1
6  foo    one -1.667972  0.990599       3       1              2
7  foo  three -1.937627 -0.834636       2       1              1

对于RepeatInsideA,再次使用与Repeat类似的逻辑,示例 -

In [24]: data['RepeatInsideA'] = (data['CountsInsideA'] > 1).astype(int)

In [25]: data
Out[25]:
     A      B         C         D  Counts  Repeat  CountsInsideA  \
0  foo    one -0.976018 -1.887011       3       1              2
1  bar    one -0.481183  2.937111       3       1              1
2  foo    two -0.702470 -0.328288       3       1              2
3  bar  three  0.579954 -2.742163       2       1              1
4  foo    two  2.125964 -0.689301       3       1              2
5  bar    two  0.699109 -0.380017       3       1              1
6  foo    one -1.667972  0.990599       3       1              2
7  foo  three -1.937627 -0.834636       2       1              1

   RepeatInsideA
0              1
1              0
2              1
3              0
4              1
5              0
6              1
7              0

答案 2 :(得分:0)

你可以轻松地做到这一点

In [57]: 
data['CountsInsideA'] = data[['A' , 'B' , 'C']].groupby(['A' , 'B']).transform('count')
In [58]:
data['RepeatInsideA'] = np.where(data['CountsInsideA'] > 1 , 1 , 0)