Pandas diff()函数的SQL模拟(第一个离散差异)[LAG函数]

时间:2018-01-09 09:14:00

标签: python sql oracle pandas difference

我正在寻找一种方法来编写一个SQL查询,该查询将第一个离散差异应用于原始系列。使用Pandas的.diff()方法在Python中这非常容易:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB'))

df["diff_A"]=df["A"].diff()
df["diff_B"]=df["B"].diff()

print(df)

我希望的输出在"diff_A""diff_B"列中展示:

    A   B  diff_A  diff_B
0  36  14     NaN     NaN
1  32  13    -4.0    -1.0
2  31  87    -1.0    74.0
3  58  88    27.0     1.0
4  44  34   -14.0   -54.0
5   2  43   -42.0     9.0
6  15  94    13.0    51.0
7  46  74    31.0   -20.0
8  60   9    14.0   -65.0
9  43  57   -17.0    48.0

我使用的是Oracle,但我绝对更喜欢干净的ANSI解决方案。

2 个答案:

答案 0 :(得分:2)

IIUC您可以使用分析LAG功能:

with v as (
  select rowid as rn, a, b from tab
)
select
  a, b,
  a - lag(a, 1) over(order by rn) as diff_a,
  b - lag(b, 1) over(order by rn) as diff_b
from v
order by rn;

PS使用真实列(如日期)进行排序会更好,因为rowid can be changed

例如:

select
  a, b,
  a - lag(a, 1) over(order by inserted) as diff_a,
  b - lag(b, 1) over(order by inserted) as diff_b
from tab;

@MatBailie has posted a very good explanation

  

SQL中的数据集是无序的。对于LAG中的确定性结果()   始终使用足够的ORDER BY子句。 (如果不存在这样的字段,则一个   应该在插入到SQL数据中的数据之前/之前创建   组。 SQL数据集的无序特性允许大量的数据集   可扩展性选项和优化选项。)

SQL Fiddle test

PS Windowing functions were added to the ANSI/ISO Standard SQL:2003 and then extended in ANSI/ISO Standard SQL:2008. Microsoft was late to this game. DB2, Oracle, Sybase, PostgreSQL and other products have had full implementations for years. SQL Server did not catch up until SQL 2012.

答案 1 :(得分:2)

我发布这个答案只是因为我能够在接受答案中的评论之后在SQLFiddle中复制结果。除了事后rowid改变之外,还有一个有效的论据,为什么这个简单的答案不起作用。

select
  a, b,
  a - lag(a, 1) over(order by rowid) as diff_a,
  b - lag(b, 1) over(order by rowid) as diff_b
from tab;