我有这个pandas数据帧:
data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : random.randn(8), 'D' : random.randn(8)})
Out[84]:
A B C D
0 foo one 0.007861 -0.451943
1 bar one -1.341386 -0.799740
2 foo two -0.290606 -0.445757
3 bar three 0.519251 -0.404406
4 foo two -0.627547 -0.784901
5 bar two 0.309421 0.234292
6 foo one -2.156879 0.898375
7 foo three -1.669896 0.498978
我所做的是应用此函数来获取B
中重复元素的数量data['Counts'] = data.groupby(['B'])['B'].transform('count')
这给了我:
Out[87]:
A B C D Counts
0 foo one 0.007861 -0.451943 3
1 bar one -1.341386 -0.799740 3
2 foo two -0.290606 -0.445757 3
3 bar three 0.519251 -0.404406 2
4 foo two -0.627547 -0.784901 3
5 bar two 0.309421 0.234292 3
6 foo one -2.156879 0.898375 3
7 foo three -1.669896 0.498978 2
然后我创建了一个新列作为布尔分类器,其中1表示重复至少一次的行,0表示不重复的行(本例中不为0)
data.ix[data.Counts >= 2,'Repeat'] = 1
data.ix[data.Counts <= 1,'Repeat'] = 0
Out[89]:
A B C D Counts Repeat
0 foo one 0.007861 -0.451943 3 1
1 bar one -1.341386 -0.799740 3 1
2 foo two -0.290606 -0.445757 3 1
3 bar three 0.519251 -0.404406 2 1
4 foo two -0.627547 -0.784901 3 1
5 bar two 0.309421 0.234292 3 1
6 foo one -2.156879 0.898375 3 1
7 foo three -1.669896 0.498978 2 1
我想要获得的是另一个Count列,它计算B中元素在A中具有相同值时重复的次数,并根据此值,使用布尔分类器对它们进行分类。这将是:
Out[89]:
A B C D Counts Repeat CountsInsideA RepeatInsideA
0 foo one 0.007861 -0.451943 3 1 2 1
1 bar one -1.341386 -0.799740 3 1 1 0
2 foo two -0.290606 -0.445757 3 1 2 1
3 bar three 0.519251 -0.404406 2 1 1 0
4 foo two -0.627547 -0.784901 3 1 2 1
5 bar two 0.309421 0.234292 3 1 1 0
6 foo one -2.156879 0.898375 3 1 2 1
7 foo three -1.669896 0.498978 2 1 1 0
答案 0 :(得分:1)
检查一下,首先您可以使用repeat
制作np.where
列,这不简洁。 2,要计算特定AB组合的重复次数,我们可能需要使用gourpby
,并将结果与原始DataFrame
合并:
In [19]:
data = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8), 'D' : np.random.randn(8)})
In [20]:
data['Counts'] = data.groupby(['B'])['B'].transform('count')
print data
A B C D Counts
0 foo one -0.973299 -0.248367 3
1 bar one 0.518526 0.987810 3
2 foo two -0.031224 0.340774 3
3 bar three -0.146824 -0.751124 2
4 foo two -0.748681 -0.128536 3
5 bar two 0.744051 0.604505 3
6 foo one -0.513386 1.262674 3
7 foo three 0.044814 0.810772 2
In [21]:
data['Repeat'] = np.where(data.Counts>1, 1, 0)
print data
A B C D Counts Repeat
0 foo one -0.973299 -0.248367 3 1
1 bar one 0.518526 0.987810 3 1
2 foo two -0.031224 0.340774 3 1
3 bar three -0.146824 -0.751124 2 1
4 foo two -0.748681 -0.128536 3 1
5 bar two 0.744051 0.604505 3 1
6 foo one -0.513386 1.262674 3 1
7 foo three 0.044814 0.810772 2 1
In [23]:
data = pd.merge(left=data,
right=pd.DataFrame(data.groupby(['A','B']).size(),
columns=['CountsInsideA']).reset_index(),
on=['A', 'B'],
how='left')
print data
A B C D Counts Repeat CountsInsideA
0 foo one -0.973299 -0.248367 3 1 2
1 bar one 0.518526 0.987810 3 1 1
2 foo two -0.031224 0.340774 3 1 2
3 bar three -0.146824 -0.751124 2 1 1
4 foo two -0.748681 -0.128536 3 1 2
5 bar two 0.744051 0.604505 3 1 1
6 foo one -0.513386 1.262674 3 1 2
7 foo three 0.044814 0.810772 2 1 1
In [25]:
data['RepeatInsideA'] = np.where(data.CountsInsideA>1, 1, 0)
print data
A B C D Counts Repeat CountsInsideA RepeatInsideA
0 foo one -0.973299 -0.248367 3 1 2 1
1 bar one 0.518526 0.987810 3 1 1 0
2 foo two -0.031224 0.340774 3 1 2 1
3 bar three -0.146824 -0.751124 2 1 1 0
4 foo two -0.748681 -0.128536 3 1 2 1
5 bar two 0.744051 0.604505 3 1 1 0
6 foo one -0.513386 1.262674 3 1 2 1
7 foo three 0.044814 0.810772 2 1 1 0
答案 1 :(得分:1)
对于重复列,您可以检查data['Count']
是否大于1,如果是,它将返回True / False值,您可以将其转换为int,它们将分别变为1或0。示例 -
In [20]: data['Repeat'] = (data['Counts'] > 1).astype(int)
In [21]: data
Out[21]:
A B C D Counts Repeat
0 foo one -0.976018 -1.887011 3 1
1 bar one -0.481183 2.937111 3 1
2 foo two -0.702470 -0.328288 3 1
3 bar three 0.579954 -2.742163 2 1
4 foo two 2.125964 -0.689301 3 1
5 bar two 0.699109 -0.380017 3 1
6 foo one -1.667972 0.990599 3 1
7 foo three -1.937627 -0.834636 2 1
对于CountsInsideA
列,您可以使用与Count
相同的逻辑,仅groupby
使用A
以及B
,示例 -
In [22]: data['CountsInsideA'] = data.groupby(['A','B'])['B'].transform('count')
In [23]: data
Out[23]:
A B C D Counts Repeat CountsInsideA
0 foo one -0.976018 -1.887011 3 1 2
1 bar one -0.481183 2.937111 3 1 1
2 foo two -0.702470 -0.328288 3 1 2
3 bar three 0.579954 -2.742163 2 1 1
4 foo two 2.125964 -0.689301 3 1 2
5 bar two 0.699109 -0.380017 3 1 1
6 foo one -1.667972 0.990599 3 1 2
7 foo three -1.937627 -0.834636 2 1 1
对于RepeatInsideA
,再次使用与Repeat
类似的逻辑,示例 -
In [24]: data['RepeatInsideA'] = (data['CountsInsideA'] > 1).astype(int)
In [25]: data
Out[25]:
A B C D Counts Repeat CountsInsideA \
0 foo one -0.976018 -1.887011 3 1 2
1 bar one -0.481183 2.937111 3 1 1
2 foo two -0.702470 -0.328288 3 1 2
3 bar three 0.579954 -2.742163 2 1 1
4 foo two 2.125964 -0.689301 3 1 2
5 bar two 0.699109 -0.380017 3 1 1
6 foo one -1.667972 0.990599 3 1 2
7 foo three -1.937627 -0.834636 2 1 1
RepeatInsideA
0 1
1 0
2 1
3 0
4 1
5 0
6 1
7 0
答案 2 :(得分:0)
你可以轻松地做到这一点
In [57]:
data['CountsInsideA'] = data[['A' , 'B' , 'C']].groupby(['A' , 'B']).transform('count')
In [58]:
data['RepeatInsideA'] = np.where(data['CountsInsideA'] > 1 , 1 , 0)