PANDAS-对多于一列的嵌套groupby正确执行(几列构成唯一标识符)

时间:2019-02-14 11:51:13

标签: python sql pandas pandas-groupby

请忽略大量的列,复制和粘贴我当前的示例要容易得多。

眼前的问题:下面的四列结合在一起,是我一行的唯一标识符。这些列是 param01,param02,param03,param04 。我希望能够观察到所有其他列如何随 param04 的变化,同时选择了 param01,param02,param03 的唯一组合。也就是说,如果param01,param02,param03的组合对应于param04的多个条目,我想保留该结果。

理想情况下,在结果结尾处,我希望将{em> param01,param02,param03 的独特组合简化为table / datafram到一个 param04 的条目。最后,针对其他参数的特定组合,根据更改 param04 的功能来绘制其他任何列。

我正在寻找有关如何在 pandas SQL ish

中做到这一点的想法

<table><tbody><tr><th>&lt;100&gt;_poisson </th><th>avg wall time (s) </th><th>bulk_hill </th><th>c_{11} </th><th>c_{12} </th><th>c_{44} </th><th>homo_poisson </th><th>param01 </th><th>param02 </th><th>param03 </th><th>param04 </th><th>shear_hill </th><th>time_generated </th><th>young_hill</th><th> </th></tr><tr><td>0 </td><td>0.264 </td><td>0 </td><td>91.6 </td><td>160.0 </td><td>57.4 </td><td>75.8 </td><td>0.214 </td><td>50.0 </td><td>50.0 </td><td>11.0 </td><td>4.0 </td><td>64.8 </td><td>2019-02-14 11:11:39.254305 </td><td>157.3</td></tr><tr><td>1 </td><td>0.268 </td><td>0 </td><td>89.5 </td><td>154.9 </td><td>56.8 </td><td>76.8 </td><td>0.211 </td><td>70.0 </td><td>50.0 </td><td>11.0 </td><td>4.0 </td><td>64.2 </td><td>2019-02-14 11:11:43.696335 </td><td>155.4</td></tr><tr><td>2 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.8 </td><td>0.210 </td><td>90.0 </td><td>50.0 </td><td>11.0 </td><td>4.0 </td><td>64.2 </td><td>2019-02-14 11:11:47.814102 </td><td>155.3</td></tr><tr><td>3 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.7 </td><td>0.210 </td><td>110.0 </td><td>50.0 </td><td>11.0 </td><td>4.0 </td><td>64.1 </td><td>2019-02-14 11:11:52.052636 </td><td>155.2</td></tr><tr><td>4 </td><td>0.268 </td><td>0 </td><td>89.5 </td><td>154.9 </td><td>56.8 </td><td>76.8 </td><td>0.211 </td><td>130.0 </td><td>50.0 </td><td>11.0 </td><td>4.0 </td><td>64.1 </td><td>2019-02-14 11:11:55.752065 </td><td>155.3</td></tr><tr><td>5 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.7 </td><td>0.210 </td><td>150.0 </td><td>50.0 </td><td>11.0 </td><td>4.0 </td><td>64.1 </td><td>2019-02-14 11:11:59.631407 </td><td>155.2</td></tr><tr><td>6 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.7 </td><td>0.210 </td><td>110.0 </td><td>30.0 </td><td>11.0 </td><td>4.0 </td><td>64.1 </td><td>2019-02-14 11:12:03.275825 </td><td>155.2</td></tr><tr><td>7 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.7 </td><td>0.210 </td><td>110.0 </td><td>40.0 </td><td>11.0 </td><td>4.0 </td><td>64.1 </td><td>2019-02-14 11:12:07.057999 </td><td>155.2</td></tr><tr><td>8 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.7 </td><td>0.210 </td><td>110.0 </td><td>60.0 </td><td>11.0 </td><td>4.0 </td><td>64.1 </td><td>2019-02-14 11:12:11.655756 </td><td>155.2</td></tr><tr><td>9 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.3 </td><td>0.211 </td><td>110.0 </td><td>50.0 </td><td>7.0 </td><td>4.0 </td><td>63.9 </td><td>2019-02-14 11:12:15.474917 </td><td>154.8</td></tr><tr><td>10 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.4 </td><td>0.211 </td><td>110.0 </td><td>50.0 </td><td>9.0 </td><td>4.0 </td><td>63.9 </td><td>2019-02-14 11:12:19.727918 </td><td>154.9</td></tr><tr><td>11 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.9 </td><td>0.210 </td><td>110.0 </td><td>50.0 </td><td>13.0 </td><td>4.0 </td><td>64.2 </td><td>2019-02-14 11:12:24.841238 </td><td>155.3</td></tr><tr><td>12 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.7 </td><td>0.210 </td><td>110.0 </td><td>50.0 </td><td>11.0 </td><td>2.0 </td><td>64.1 </td><td>2019-02-14 11:12:29.916590 </td><td>155.2</td></tr><tr><td>13 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.7 </td><td>0.210 </td><td>110.0 </td><td>50.0 </td><td>11.0 </td><td>3.0 </td><td>64.1 </td><td>2019-02-14 11:12:35.019309 </td><td>155.2</td></tr><tr><td>14 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.7 </td><td>0.210 </td><td>110.0 </td><td>50.0 </td><td>11.0 </td><td>5.0 </td><td>64.1 </td><td>2019-02-14 11:12:39.904661 </td><td>155.2</td></tr><tr><td>15 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.7 </td><td>0.210 </td><td>110.0 </td><td>50.0 </td><td>11.0 </td><td>6.0 </td><td>64.1 </td><td>2019-02-14 11:12:44.982282 </td><td>155.2</td></tr><tr><td>16 </td><td>0.017 </td><td>0 </td><td>287.3 </td><td>799.5 </td><td>47.7 </td><td>120.4 </td><td>0.243 </td><td>30.0 </td><td>30.0 </td><td>5.0 </td><td>4.0 </td><td>177.9 </td><td>2019-02-14 11:12:50.124683 </td><td>442.3</td></tr><tr><td>17 </td><td>0.264 </td><td>0 </td><td>91.6 </td><td>159.9 </td><td>57.5 </td><td>76.2 </td><td>0.213 </td><td>40.0 </td><td>30.0 </td><td>5.0 </td><td>4.0 </td><td>65.0 </td><td>2019-02-14 11:12:54.744038 </td><td>157.7</td></tr><tr><td>18 </td><td>0.264 </td><td>0 </td><td>91.7 </td><td>160.1 </td><td>57.5 </td><td>76.2 </td><td>0.213 </td><td>50.0 </td><td>30.0 </td><td>5.0 </td><td>4.0 </td><td>65.0 </td><td>2019-02-14 11:12:58.547615 </td><td>157.8</td></tr><tr><td>19 </td><td>0.268 </td><td>0 </td><td>89.4 </td><td>154.8 </td><td>56.6 </td><td>76.4 </td><td>0.210 </td><td>60.0 </td><td>30.0 </td><td>5.0 </td><td>4.0 </td><td>64.1 </td><td>2019-02-14 11:13:03.234323 </td><td>155.3</td></tr><tr><td>20 </td><td>4.923 </td><td>0 </td><td>-5.8 </td><td>0.0 </td><td>0.0 </td><td>46.3 </td><td>-1.138 </td><td>30.0 </td><td>10.0 </td><td>5.0 </td><td>4.0 </td><td>208.5 </td><td>2019-02-14 11:13:08.527995 </td><td>-57.4</td></tr><tr><td>21 </td><td>0.015 </td><td>0 </td><td>728.8 </td><td>2305.4 </td><td>96.4 </td><td>75.6 </td><td>0.334 </td><td>30.0 </td><td>20.0 </td><td>5.0 </td><td>4.0 </td><td>272.0 </td><td>2019-02-14 11:13:15.060308 </td><td>725.7</td></tr></tbody></table>

1 个答案:

答案 0 :(得分:0)

我希望我能正确理解:

  

我希望通过参数param01,param02,param03的这种独特组合将一个表/数据帧减少为一个,其中param04具有多个条目。

因此,您需要一个类似SQL SELECT param01,param02, param03 GROUP BY param04 HAVING COUNT(*) > 1

如果是这样:

import pandas as pd

html=r'<table><tbody><tr><th> </th><th>&lt;100&gt;_poisson </th><th>avg wall time (s) </th><th>bulk_hill </th><th>c_{11} </th><th>c_{12} </th><th>c_{44} </th><th>homo_poisson </th><th>param01 </th><th>param02 </th><th>param03 </th><th>param04 </th><th>shear_hill </th><th>time_generated </th><th>young_hill</th></tr><tr><td>0 </td><td>0.264 </td><td>0 </td><td>91.6 </td><td>160.0 </td><td>57.4 </td><td>75.8 </td><td>0.214 </td><td>50.0 </td><td>50.0 </td><td>11.0 </td><td>4.0 </td><td>64.8 </td><td>2019-02-14 11:11:39.254305 </td><td>157.3</td></tr><tr><td>1 </td><td>0.268 </td><td>0 </td><td>89.5 </td><td>154.9 </td><td>56.8 </td><td>76.8 </td><td>0.211 </td><td>70.0 </td><td>50.0 </td><td>11.0 </td><td>4.0 </td><td>64.2 </td><td>2019-02-14 11:11:43.696335 </td><td>155.4</td></tr><tr><td>2 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.8 </td><td>0.210 </td><td>90.0 </td><td>50.0 </td><td>11.0 </td><td>4.0 </td><td>64.2 </td><td>2019-02-14 11:11:47.814102 </td><td>155.3</td></tr><tr><td>3 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.7 </td><td>0.210 </td><td>110.0 </td><td>50.0 </td><td>11.0 </td><td>4.0 </td><td>64.1 </td><td>2019-02-14 11:11:52.052636 </td><td>155.2</td></tr><tr><td>4 </td><td>0.268 </td><td>0 </td><td>89.5 </td><td>154.9 </td><td>56.8 </td><td>76.8 </td><td>0.211 </td><td>130.0 </td><td>50.0 </td><td>11.0 </td><td>4.0 </td><td>64.1 </td><td>2019-02-14 11:11:55.752065 </td><td>155.3</td></tr><tr><td>5 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.7 </td><td>0.210 </td><td>150.0 </td><td>50.0 </td><td>11.0 </td><td>4.0 </td><td>64.1 </td><td>2019-02-14 11:11:59.631407 </td><td>155.2</td></tr><tr><td>6 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.7 </td><td>0.210 </td><td>110.0 </td><td>30.0 </td><td>11.0 </td><td>4.0 </td><td>64.1 </td><td>2019-02-14 11:12:03.275825 </td><td>155.2</td></tr><tr><td>7 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.7 </td><td>0.210 </td><td>110.0 </td><td>40.0 </td><td>11.0 </td><td>4.0 </td><td>64.1 </td><td>2019-02-14 11:12:07.057999 </td><td>155.2</td></tr><tr><td>8 </td><td>0.268 </td><td>0 </td><td>89.3 </td><td>154.7 </td><td>56.6 </td><td>76.7 </td><td>0.210 </td><td>110.0 </td><td>60.0 </td><td>11.0 </td><td>4.0 </td><td>64.1 </td><td>2019-02-14 11:12:11.655756 </td><td>155.2</td></tr></tbody></table>'
df = pd.read_html(html,header=0)[0]
df_params=df[['param01','param02', 'param03', 'param04']]
df_params.groupby('param04').filter(lambda x: len(x) > 1)

输出:

     param01  param02  param03  param04
 0     50.0     50.0     11.0      4.0
 1     70.0     50.0     11.0      4.0
 2     90.0     50.0     11.0      4.0
 3    110.0     50.0     11.0      4.0
 4    130.0     50.0     11.0      4.0
 5    150.0     50.0     11.0      4.0
 6    110.0     30.0     11.0      4.0
 7    110.0     40.0     11.0      4.0
 8    110.0     60.0     11.0      4.0

编辑:类似EXISTS以返回所有列

类似物:

SELECT * FROM 
    source_data T 
    JOIN (SELECT param01,param02, param03 GROUP BY param04 HAVING 
          COUNT(*) > 1) FLT 
      ON T.param01 = FLT.param01 
         AND T.param02=FLT.param02 
         AND T.param03=FLT.param03 

是:

pd.merge(df, df_params.groupby('param04').filter(lambda x: len(x) > 1), on=['param01','param02','param03'])

尽管我认为必须写得更简洁,但它必须是正确的。