Question

我有一个数据框，其中包含美国8个不同城市的气候观测结果。

我试图找到数据集中每个位置每年的热浪数量（连续3天，最大等于或大于90度）。

我将热浪定义为连续3天，但连续3天不连续。例如：

If Aug. 8 hit 87
   Aug. 9 hit 90
   Aug. 10 hit 92
   Aug. 11 hit 94
   Aug. 12 hit 93
   Aug. 13 hit 101
   Aug. 14 hit 94
   Aug. 15 hit 77

在“热浪”（HeatWave）列中，8月9日和8月12日的值将为“ 1”，这反映了两个单独的3天时段，最大值达到90或更高。

我目前的策略还没有解决过这样的问题。

我一直在尝试使用np.where。首先，我要检查一下当天的温度是否达到90或更高。接下来，我要检查接下来的两天最大数量是否达到或超过90。最后，我检查前两天，以查看HeatWave列中是否没有'1'。如果满足所有这些条件，则将1放置在该行的“ HeatWave”列中。

summer['Next90'] = summer.Max.shift(-1)
summer['Following90'] = summer.Max.shift(-2)
summer['HeatWave'] = 0    
summer['HeatWave'] = np.where((summer['Next90']>=90) & 
(summer['Max']>=90) & (summer['Following90']>=90) & (summer.shift(1) 
['HeatWave']!=1) & (summer.shift(2)['HeatWave']!=1), 1, np.nan)

此序列的问题是我不认为np.where在前一行中仅放置了1（或np.nan）之后才能检查“ HeatWave”列。因此，我在HeatWave列中得到很多“ 1”，但是序列最终被重复计数了。我也使用迭代在for循环中尝试了此操作，但遇到了同样的困难。有人可以建议一种更好的方法吗？

Answer 1

这是您可以尝试的一种方法（示例数据显示在文章末尾）

获取数据，然后设置连续天数= 3

df = pd.read_csv('/path/to/file', sep='\s\s+', engine='python', parse_dates=['date'])
# N-day streak
N = 3

删除潜在的重复对象，修复丢失的日期并将NULL'temp'设置为0

# if there are duplicates in the same date, drop them and keep the one with highest temp
df = df.sort_values(['date', 'temp'], ascending=[1,0]).drop_duplicates(subset=['date'])

# fix missing dates issue and fill missing 'temp' with 0
df = df.set_index('date').asfreq('D').reset_index().fillna(0)
print(df)
#         date  temp
#0  2018-08-01    83
#1  2018-08-02    99
#2  2018-08-03    99
#3  2018-08-04    87
#4  2018-08-05    90
#5  2018-08-06    92
#6  2018-08-07     0
#7  2018-08-08    92
#8  2018-08-09    90
#9  2018-08-10    92
#10 2018-08-11    94
#11 2018-08-12    93
#12 2018-08-13   101
#13 2018-08-14    94
#14 2018-08-15    77

设置符合热浪条件的条件

# contition-1  df.temp >= 90
c1 = df.temp.ge(90)

根据条件1将连续的行分组，并用g标记它们

# group label (each group forms a streak)
g = (c1 != c1.shift()).cumsum()

定义一个新的df1。对于每组g，找到以下内容：

cnt：总行数
n：cumcount（）作为序列号

g：在此处添加仅供参考，不用于任何进一步的计算

df1 = df.assign(
    cnt=df.groupby(g).date.transform('count')
  , n=df.groupby(g).agg('cumcount')
  , g=g
)
print(df1)
#         date  temp  cnt  g  n
#0  2018-08-01    83    1  1  0
#1  2018-08-02    99    2  2  0
#2  2018-08-03    99    2  2  1
#3  2018-08-04    87    1  3  0
#4  2018-08-05    90    2  4  0
#5  2018-08-06    92    2  4  1
#6  2018-08-07     0    1  5  0
#7  2018-08-08    92    7  6  0
#8  2018-08-09    90    7  6  1
#9  2018-08-10    92    7  6  2
#10 2018-08-11    94    7  6  3
#11 2018-08-12    93    7  6  4
#12 2018-08-13   101    7  6  5
#13 2018-08-14    94    7  6  6
#14 2018-08-15    77    1  7  0

再定义两个条件：

# condition-2: cnt >= N , a streak must have at least N rows
c2 = df1.cnt.ge(N)

# condition-3: (n%N)==0 and (n+N) <= cnt
# the last n%N==0 might not have enough dates for a N-day streak
c3 = df1.n.mod(N).eq(0) & df1.n.le(df1.cnt-N)

df中的最终标志应具有：

df['flag'] = np.where(c1 & c2 & c3, 1, 0)
print(df)
#         date  temp  flag
#0  2018-08-01    83     0
#1  2018-08-02    99     0
#2  2018-08-03    99     0
#3  2018-08-04    87     0
#4  2018-08-05    90     0
#5  2018-08-06    92     0
#6  2018-08-07     0     0
#7  2018-08-08    92     1
#8  2018-08-09    90     0
#9  2018-08-10    92     0
#10 2018-08-11    94     1
#11 2018-08-12    93     0
#12 2018-08-13   101     0
#13 2018-08-14    94     0
#14 2018-08-15    77     0

删除临时df1：
```
del(df1)
```

样本数据

date           temp
Aug 1, 2018    83
Aug 2, 2018    99
Aug 2, 2018    65
Aug 3, 2018    99
Aug 2, 2018    70
Aug 4, 2018    87
Aug 5, 2018    90
Aug 6, 2018    92
Aug 8, 2018    92
Aug 9, 2018    90
Aug 10, 2018    92
Aug 11, 2018    94
Aug 12, 2018    93
Aug 13, 2018    101
Aug 14, 2018    94
Aug 15, 2018    77

在数据框的一列中满足条件时跟踪3天条纹，而无需重复计算条纹

1 个答案: