我试图实现一个等同于pandas查询的SQL查询:
df.groupby([df.ST_STATE, 'INVOICE'])['VALUE'].sum()
这将返回所有发票及其在以下状态组中的值:
WV 114763 28.00
116443 16.50
116490 24.00
116550 46.00
WY 100099 9.00
100148 9.00
100881 32.00
101119 28.00
我已经编写了一个SQL查询来从Google Cloud中获取相同类型的结果集:
query = "SELECT State, Invoice, sum(Value) FROM ["+self.table+"] group by Invoice, State"
但它返回标准结果集:
State Invoice f0_
0 NY 100008 86.00
1 None 100335 64.00
2 NY 100685 60.00
如何操作SQL查询以获得与我的DataFrame示例相同的结果?
答案 0 :(得分:2)
看起来您需要做的就是通过以下方式更改组的顺序:
query = "SELECT State, Invoice, sum(Value) FROM ["+self.table+"] group by State, Invoice"
因此,分组的应用顺序与您的pandas示例相同。
答案 1 :(得分:1)
以下是纯SQL(BigQuery标准SQL)中的一个示例 - 希望您能够"翻译"它分别进入大熊猫
#standardSQL
WITH t AS (
SELECT 'WV' state, [STRUCT<invoice INT64, value FLOAT64>
(114763, 28.00),
(114763, 16.50),
(116490, 24.00),
(116490, 46.00)
] info UNION ALL
SELECT 'WY', [STRUCT<invoice INT64, value FLOAT64>
(100099, 9.00),
(100148, 9.00),
(100099, 32.00),
(100148, 28.00)
]
)
SELECT state,
ARRAY(
SELECT AS STRUCT invoice, SUM(value) AS value
FROM UNNEST(info) i GROUP BY invoice
) info
FROM t
结果与原始数据的形状相同,如下所示
Row state info.invoice info.value
1 WV 114763 44.5
116490 70.0
2 WY 100099 41.0
100148 37.0
注意:我稍微修改了您的数据示例,以便进行分组