如何按大熊猫的一系列值进行分组?

时间:2016-03-20 18:58:57

标签: python pandas

我有dataframe,我希望按分类变量和一系列值进行分组。您可能会将其视为类似值的行(群集?)。 E.g:

df = pd.DataFrame({'symbol' : ['IP', 'IP', 'IP', 'IP', 'IP', 'IP', 'IP'],
                   'serie' : ['A', 'B', 'A', 'B', 'A', 'B', 'B'],
                   'strike' : [10, 10, 12, 13, 12, 13, 14],
                   'last' : [1, 2, 2.5, 3, 4.5, 5, 6],
                   'price' : [11, 11, 11, 11, 11, 11, 11],
                   'type' : ['call', 'put', 'put', 'put', 'call', 'put', 'call']})

如果我使用

grouped = df.groupby(['symbol', 'serie', 'strike'])

我已经解决了部分问题,但我希望将更接近的打击值组合起来,例如10和11,12和13等等。优选在%范围内。

2 个答案:

答案 0 :(得分:2)

body { padding: 0; /* don't let the browser try to be fancy */ margin: 0; /* don't let the browser try to be fancy */ box-sizing: border-box; /* div width, height, border and padding included. margin not */ word-wrap: break-word; /* break words to avoid text going outside of div */ } /* Default behaviour is mobile @media (min-width: 63em) { .en { width: calc(48% - 15px); margin-bottom: 0px; margin-right: calc(15px + 2%); } .pt { width: calc(48% - 15px); margin-left: calc(15px + 2%); } } */ /* the viewport property is here to make sure the device (mobile) won't zoom out too present the desktop version */ /* http://webdesign.tutsplus.com/articles/quick-tip-dont-forget-the-viewport-meta-tag--webdesign-5972 */ @viewport{ zoom: 1.0; width: extend-to-zoom; /* this is to ensure it shows correctly in both landscape and portrait mode */ } /* Obsessive compulsive behaviour: Nobody touch the borders! */ .page { line-height: 1.3rem; margin-top: calc(1.5rem + 1%); margin-bottom: calc(1.5rem + 1%); margin-left: calc(1.5rem + 2%); margin-right: calc(1.5rem + 2%); } /* MAIN STRUCTURE */ /* This is an ID because we only use it once and specifically*/ #header { width: 100%; display: inline-block; margin-bottom: 1.3rem; } #navigation { width: 100%; display: inline-block; } /* inside the header */ #title { float: left; font-size: 1.3rem; } #meta { float: right; } .entries { float: left; display: inline-block; width: 100%; } /* ENTRIES */ .entry{ float: left; margin-bottom: 1.3rem; } .en { float: left; width: 100%; margin-bottom: 10px; } .pt { float: left; width: 100%; font-style: italic; } .entry_title{ font-weight: bold; } .entry_body{ } .entry_category{ } .entry_footer{ color: grey; }

进行groupy()

使用pd.cut创建点击数据的分类,然后按该信息分组:

strike
# Create DataFrame
df = pd.DataFrame({
    'symbol' : ['IP', 'IP', 'IP', 'IP', 'IP', 'IP', 'IP'],
    'serie' : ['A', 'B', 'A', 'B', 'A', 'B', 'B'],
    'strike' : [10, 10, 12, 13, 12, 13, 14],
    'last' : [1, 2, 2.5, 3, 4.5, 5, 6],
    'price' : [11, 11, 11, 11, 11, 11, 11],
    'type' : ['call', 'put', 'put', 'put', 'call', 'put', 'call']
})
# Create Bins (example three bins across data)
df['strikebins'] = pd.cut(df['strike'], bins=3)

print 'Binned DataFrame:'
print df
print

# Group these DataFrame
grouped = df.groupby(['symbol', 'serie', 'strikebins'])

# Do something with groups for example
gp_sum = grouped.sum()

print 'Grouped Sum (for example):'
print gp_sum
print

如果您愿意,可以Binned DataFrame: last price serie strike symbol type strikebins 0 1.0 11 A 10 IP call (9.996, 11.333] 1 2.0 11 B 10 IP put (9.996, 11.333] 2 2.5 11 A 12 IP put (11.333, 12.667] 3 3.0 11 B 13 IP put (12.667, 14] 4 4.5 11 A 12 IP call (11.333, 12.667] 5 5.0 11 B 13 IP put (12.667, 14] 6 6.0 11 B 14 IP call (12.667, 14] Grouped Sum (for example): last price strike symbol serie strikebins IP A (9.996, 11.333] 1 11 10 (11.333, 12.667] 7 22 24 (12.667, 14] NaN NaN NaN B (9.996, 11.333] 2 11 10 (11.333, 12.667] NaN NaN NaN (12.667, 14] 14 33 40 drop(),或者使用范围的平均值替换strike ...

答案 1 :(得分:1)

我猜OP想要按分类变量进行分组,然后按间隔分组数字变量。在这种情况下,您可以使用np.digitize()

smallest = np.min(df['strike'])
largest = np.max(df['strike'])
num_edges = 3
# np.digitize(input_array, bin_edges)
ind = np.digitize(df['strike'], np.linspace(smallest, largest, num_edges))

然后ind

array([1, 1, 2, 2, 2, 2, 3], dtype=int64)

对应于分箱

 [10, 10, 12, 13, 12, 13, 14]
带有bin边缘的

array([ 10.,  12.,  14.]) # == np.linspace(smallest, largest, num_edges)

最后,按所需的所有列进行分组,但使用此附加bin列

df['binned_strike'] = ind
for grp in df.groupby(['symbol', 'serie', 'binned_strike']):
    print "group key"
    print grp[0]
    print "group content"
    print grp[1]
    print "============="

这应该打印

group key
('IP', 'A', 1)
group content
   last  price serie  strike symbol  type  binned_strike
0   1.0     11     A      10     IP  call              1
=============
group key
('IP', 'A', 2)
group content
   last  price serie  strike symbol  type  binned_strike
2   2.5     11     A      12     IP   put              2
4   4.5     11     A      12     IP  call              2
=============
group key
('IP', 'B', 1)
group content
   last  price serie  strike symbol type  binned_strike
1   2.0     11     B      10     IP  put              1
=============
group key
('IP', 'B', 2)
group content
   last  price serie  strike symbol type  binned_strike
3   3.0     11     B      13     IP  put              2
5   5.0     11     B      13     IP  put              2
=============
group key
('IP', 'B', 3)
group content
   last  price serie  strike symbol  type  binned_strike
6   6.0     11     B      14     IP  call              3
=============