在“前100天”评论之后进行编辑

Question

我想通过匹配日期数据将df2的补充信息添加到df1。

df1是主要数据帧：

            x0      x1      x2      x3      x4      x5      ...  x10000  Date       
1           40      31.05   25.5    25.5    25.5    25      ...    33    2013-11-13
2           35      35.75   36.5    36.5    36.5    36.5    ...    29    2013-09-05
⋮           ⋮       ⋮        ⋮       ⋮       ⋮        ⋮               ⋮

df2是我想添加到df1的补充天气信息：

year month day  maxtemp mintemp rainfall    wind 
2013    1   1   26.2    20.2     0          32.4
2013    1   2   22.9    20.3     0          10
2013    1   3   24.8    18.4     0          28.8
2013    1   4   26.6    18.3     0          33.5
2013    1   5   28.3    20.9     0          33.4
2013    1   6   28      21.6     0          32.8
2013    1   7   27.5    21.4     0          26.8
2013    1   8   42.3    20.9     0          25.5
2013    1   9   25      21.1     0          20.9
2013    1   10  25.4    20.2     0          14
⋮       ⋮    ⋮   ⋮        ⋮        ⋮           ⋮

我需要将从maxtemp中提取的mintemp，rainfall，wind和df2数据的前100天添加到每一行的末尾通过与year中month的{{1}}，day，Date匹配，在df1上水平上因此，df1是第100天，而前99天是Date之前的99天。

预期输出：

Date

其中

     x0  x1    x2   x3   x4   x5   ... x10000 Date       max_t1...max_t100 min_t1...min_t100 rf1... rf100 w1 ... w100
1    40  31.05 25.5 25.5 25.5 25   ...  33    2013-01-01 26.2  ...         20.2  ...          0 ...       32.4...  
2    35  35.75 36.5 36.5 36.5 36.5 ...  29    2013-01-03 24.8. ...         18.4  ...          0 ...       28.8
⋮     ⋮   ⋮      ⋮    ⋮    ⋮     ⋮          ⋮

这些是新添加的列名（因此总共将有400个新列）。

Answer 1

我建议先在df2中创建新的400列，然后使用pandas.DataFrame.merge将其合并为df1

分为两个问题：

问题1 ：计算最近x天的汇总值

已回答here

适用于您的情况：

In[1]: df2 = pd.DataFrame({"year": [2013, 2013, 2013, 2013, 2013],
                           "month": [1, 1, 1, 1, 1],
                           "day": [1, 2, 3, 4, 5],
                           "mintemp": [26.2, 22.9, 24.8, 11.2, 10],
                           "maxtemp": [28.2, 23.9, 25.8, 22.1, 12]})
       # Create date column (type datetime64[ns])
       df2["date"] = pd.to_datetime((df2[["year", "month", "day"]]))
       # Add the 400 columns needed (I am only adding 2 as an example)
       # If you change 2 to 100 you will get your 100
       colnumber = 2
       # Maxtemp
       for i in range(1, colnumber + 1):
           col_name = "max_t" + str(i)
           df2[col_name] = df2.set_index("date").rolling(i).max()["maxtemp"].values
       # Mintemp
       for i in range(1, colnumber + 1):
           col_name = "min_t" + str(i)
           df2[col_name] = df2.set_index("date").rolling(i).min()["mintemp"].values
       # TODO: Add rainfall and wind

In[2]:df2
Out[2]: 
   year  month  day  mintemp  maxtemp       date  max_t1  max_t2  min_t1  min_t2
0  2013  1      1    26.2     28.2    2013-01-01  28.2   NaN      26.2   NaN    
1  2013  1      2    22.9     23.9    2013-01-02  23.9    28.2    22.9    22.9  
2  2013  1      3    24.8     25.8    2013-01-03  25.8    25.8    24.8    22.9  
3  2013  1      4    11.2     22.1    2013-01-04  22.1    25.8    11.2    11.2  
4  2013  1      5    10.0     12.0    2013-01-05  12.0    22.1    10.0    10.0

问题2 ：使用日期列作为公用键水平合并两个数据框

您将必须首先将列转换为日期时间（类似的答案here），然后使用公用密钥合并df。

In[3]:df1 = pd.DataFrame({"x0": [40, 35, 33, 38],
                          "x1": [31.05, 35.75, 22, 28],
                          "x1000": [33, 29, 20, 18],
                          "Date": ["2013-1-1", "2013-1-2", "2013-1-3", "2013-1-4"]})
    # Creating common key with type datetime64[ns]
    df1["date"] = pd.to_datetime(df1["Date"])

Out[3]:
   x0     x1  x1000      Date       date
0  40  31.05  33     2013-1-1 2013-01-01
1  35  35.75  29     2013-1-2 2013-01-02
2  33  22.00  20     2013-1-3 2013-01-03
3  38  28.00  18     2013-1-4 2013-01-04

In[4]: # Merging
       df1.merge(df2, how="left", left_on=["date"], right_on=["date"])

Out[4]:
   x0     x1  x1000      Date       date  year  month  day  mintemp  maxtemp  max_t1  max_t2  min_t1  min_t2
0  40  31.05  33     2013-1-1 2013-01-01  2013  1      1    26.2     28.2     28.2   NaN      26.2   NaN    
1  35  35.75  29     2013-1-2 2013-01-02  2013  1      2    22.9     23.9     23.9    28.2    22.9    22.9  
2  33  22.00  20     2013-1-3 2013-01-03  2013  1      3    24.8     25.8     25.8    25.8    24.8    22.9  
3  38  28.00  18     2013-1-4 2013-01-04  2013  1      4    11.2     22.1     22.1    25.8    11.2    11.2

编辑：添加了输出

Answer 2

我假设 df1 中的 Date 列为 datetime 类型。如果没有，请进行转换。

从这样的准备步骤开始：

在 df2 中转换 year / month / day 列的索引（ datetime 类型）：

df2 = df2.set_index(pd.to_datetime(df2.year * 10000 + df2.month * 100
    + df2.day, format='%Y%m%d')).drop(columns=['year', 'month', 'day'])

设置要添加列的天数：
```
nDays = 3
```
出于演示目的，我将其设置为仅 3 ，但您可以将其更改为 100 或您希望的任何值。

为新列定义列名（第一个 import itertools ）：

cols = [ x + str(y) for x, y in itertools.product(
    ['max_t', 'min_t', 'rf', 'w'], range(1, nDays + 1)) ]

为当前行定义一个函数以生成其他列：

def fn(row):
    d1 = row.Date
    d2 = d1 + pd.Timedelta(nDays - 1, 'D')
    return pd.Series(df2.loc[d1:d2].values.reshape((1, -1),
        order='F').squeeze(), index=cols)

现在，整个处理过程可以通过单指令进行，将上述函数应用于每一行并将结果连接到原始DataFrame：

df1 = df1.join(df1.apply(fn, axis=1))

非常简洁，很大程度上是 pandasonic 解决方案。

为演示该解决方案的工作原理，我对您的数据做了一些更改：

df1：

   x0     x1    x2    x3       Date
0  40  31.05  25.5  25.5 2013-01-03
1  35  35.75  36.5  36.5 2013-01-07

df2 （初始内容）：

   year  month  day  maxtemp  mintemp  rainfall  wind
0  2013      1    1     26.2     20.2         0  32.4
1  2013      1    2     22.9     20.3         0  10.0
2  2013      1    3     24.8     18.4         1  28.8
3  2013      1    4     26.6     18.3         2  33.5
4  2013      1    5     28.3     20.9         3  33.4
5  2013      1    6     28.0     21.6         4  32.8
6  2013      1    7     27.5     21.4         5  26.8
7  2013      1    8     42.3     20.9         6  25.5
8  2013      1    9     25.0     21.1         7  20.9
9  2013      1   10     25.4     20.2         8  14.0

df2 （转换后）：

            maxtemp  mintemp  rainfall  wind
2013-01-01     26.2     20.2         0  32.4
2013-01-02     22.9     20.3         0  10.0
2013-01-03     24.8     18.4         1  28.8
2013-01-04     26.6     18.3         2  33.5
2013-01-05     28.3     20.9         3  33.4
2013-01-06     28.0     21.6         4  32.8
2013-01-07     27.5     21.4         5  26.8
2013-01-08     42.3     20.9         6  25.5
2013-01-09     25.0     21.1         7  20.9
2013-01-10     25.4     20.2         8  14.0

添加新列后， df1 包含：

   x0     x1    x2    x3       Date  max_t1  max_t2  max_t3  min_t1  min_t2  \
0  40  31.05  25.5  25.5 2013-01-03    24.8    26.6    28.3    18.4    18.3   
1  35  35.75  36.5  36.5 2013-01-07    27.5    42.3    25.0    21.4    20.9   

   min_t3  rf1  rf2  rf3    w1    w2    w3  
0    20.9  1.0  2.0  3.0  28.8  33.5  33.4  
1    21.1  5.0  6.0  7.0  26.8  25.5  20.9

在“前100天”评论之后进行编辑

如果要从当前日期之前开始100天提取行，更改 fn 函数中两个“边界日期”的设置方式。像这样：

def fn(row):
    d1 = row.Date - pd.Timedelta(nDays, 'D')
    d2 = row.Date - pd.Timedelta(1, 'D')
    return pd.Series(df2.loc[d1:d2].values.reshape((1, -1), order='F')
        .squeeze(), index=cols)

如何避免增加行数

如果您的 df2 在某些日期包含多个行，则加入 df1 与 df2 导致输出行数增加。

如果 df2 具有某个日期，例如然后从 df1 中的 1 行3行到此日期，结果将仅包含3行（具有相同的日期）。

为避免这种情况，您必须“抑制”此重复。

最初，我想到了 df2 = df2.drop_duplicates（...），但是您写了一行可以包含一组值，另一行可以包含另一组，因此我们不能随意离开一行并从同一日期删除另一行。

此问题的可能解决方案之一是在“日期索引”之后创建后，您应该：

group df2 （每个组将包含特定日期）
计算每列的平均值（忽略了可能的 NaN 值）
将结果保存回 df2 下。

执行此操作的代码是：

df2 = df2.groupby(level=0).mean()

然后您可以加入（如上所述）以及输出行数不应该增长。

根据日期数据在python中的另一个df中填写值

2 个答案:

在“前100天”评论之后进行编辑

如何避免增加行数