SQL相当于pandas DataFrame查询

时间:2018-03-26 14:45:41

标签: sql pandas google-bigquery

我试图实现一个等同于pandas查询的SQL查询:

df.groupby([df.ST_STATE, 'INVOICE'])['VALUE'].sum()

这将返回所有发票及其在以下状态组中的值:

WV        114763      28.00
          116443      16.50
          116490      24.00
          116550      46.00
WY        100099       9.00
          100148       9.00
          100881      32.00
          101119      28.00

我已经编写了一个SQL查询来从Google Cloud中获取相同类型的结果集:

query = "SELECT State, Invoice, sum(Value) FROM ["+self.table+"] group by Invoice, State"

但它返回标准结果集:

    State   Invoice f0_
0   NY  100008  86.00
1   None    100335  64.00
2   NY  100685  60.00

如何操作SQL查询以获得与我的DataFrame示例相同的结果?

2 个答案:

答案 0 :(得分:2)

看起来您需要做的就是通过以下方式更改组的顺序:

query = "SELECT State, Invoice, sum(Value) FROM ["+self.table+"] group by State, Invoice"

因此,分组的应用顺序与您的pandas示例相同。

答案 1 :(得分:1)

以下是纯SQL(BigQuery标准SQL)中的一个示例 - 希望您能够"翻译"它分别进入大熊猫

   
#standardSQL
WITH t AS (
  SELECT 'WV' state, [STRUCT<invoice INT64, value FLOAT64>
    (114763, 28.00),
    (114763, 16.50),
    (116490, 24.00),
    (116490, 46.00)
  ] info UNION ALL
  SELECT 'WY', [STRUCT<invoice INT64, value FLOAT64>
    (100099, 9.00),
    (100148, 9.00),
    (100099, 32.00),
    (100148, 28.00)
  ]
)
SELECT state, 
  ARRAY(
    SELECT AS STRUCT invoice, SUM(value) AS value 
    FROM UNNEST(info) i GROUP BY invoice
  ) info
FROM t   

结果与原始数据的形状相同,如下所示

Row state   info.invoice    info.value   
1   WV      114763          44.5     
            116490          70.0     
2   WY      100099          41.0     
            100148          37.0     

注意:我稍微修改了您的数据示例,以便进行分组