Question

我有一个类似于以下内容的DataFrame：

def my_func(records):
    if  pd.isnull(records).sum() > 0:
        return 0
    return 1

agg = df.groupby('order_name')['discount_code'].agg(my_func)

import pandas as pd
import numpy as np
date = pd.date_range(start='2020-01-01', freq='H', periods=4) 
locations = ["AA3", "AB1", "AD1", "AC0"] 
x = [5.5, 10.2, np.nan, 2.3, 11.2, np.nan, 2.1, 4.0, 6.1, np.nan, 20.3, 11.3, 4.9, 15.2, 21.3, np.nan] 

df = pd.DataFrame({'x': x}) 
df.index = pd.MultiIndex.from_product([locations, date], names=['location', 'date']) 
df = df.sort_index() 
df

索引值是位置代码和一天中的时间。我想用同一天和同一时间从最近位置开始的同一列的有效值填充列x location date AA3 2020-01-01 00:00:00 5.5 2020-01-01 01:00:00 10.2 2020-01-01 02:00:00 NaN 2020-01-01 03:00:00 2.3 AB1 2020-01-01 00:00:00 11.2 2020-01-01 01:00:00 NaN 2020-01-01 02:00:00 2.1 2020-01-01 03:00:00 4.0 AC0 2020-01-01 00:00:00 4.9 2020-01-01 01:00:00 15.2 2020-01-01 02:00:00 21.3 2020-01-01 03:00:00 NaN AD1 2020-01-01 00:00:00 6.1 2020-01-01 01:00:00 NaN 2020-01-01 02:00:00 20.3 2020-01-01 03:00:00 11.3的缺失值，其中每个位置与其他位置的接近程度定义为

nearest = pd.DataFrame({"AA3": ["AA3", "AB1", "AD1", "AC0"],
                        "AB1": ["AB1", "AA3", "AC0", "AD1"],
                        "AD1": ["AD1", "AC0", "AB1", "AA3"],
                        "AC0": ["AC0", "AD1", "AA3", "AB1"]})
nearest

在此数据集中，列名是位置代码，每列下的行值按其名称作为列名的位置的接近程度指示其他位置。

如果最近的位置在同一天和同一小时也缺少值，那么我将在同一天和同一小时获取第二个最近的位置的值。如果缺少第二个最近的位置，则在同一天和同一小时的第三个最近的位置，依此类推。

所需的输出：

   AA3  AB1  AD1  AC0
0  AA3  AB1  AD1  AC0
1  AB1  AA3  AC0  AD1
2  AD1  AC0  AB1  AA3
3  AC0  AD1  AA1  AB1

基于@kiona1018的建议，以下内容可以按预期工作，但速度较慢。

                                 x
location date                     
AA3      2020-01-01 00:00:00   5.5
         2020-01-01 01:00:00  10.2
         2020-01-01 02:00:00   2.1
         2020-01-01 03:00:00   2.3
AB1      2020-01-01 00:00:00  11.2
         2020-01-01 01:00:00  10.2
         2020-01-01 02:00:00   2.1
         2020-01-01 03:00:00   4.0
AC0      2020-01-01 00:00:00   4.9
         2020-01-01 01:00:00  15.2
         2020-01-01 02:00:00  21.3
         2020-01-01 03:00:00  11.3
AD1      2020-01-01 00:00:00   6.1
         2020-01-01 01:00:00  15.2
         2020-01-01 02:00:00  20.3
         2020-01-01 03:00:00  11.3

Answer 1

我同意Serial Lazer的观点，没有更整洁的熊猫/麻木死刑。要求有点复杂。在这种情况下，您应该发挥自己的作用。下面是一个示例。

nearest = pd.DataFrame({"AA3": ["AA3", "AB1", "AD1", "AC0"],
                        "AB1": ["AB1", "AA3", "AC0", "AD1"],
                        "AD1": ["AD1", "AC0", "AB1", "AA3"],
                        "AC0": ["AC0", "AD1", "AA3", "AB1"]})


def fill_by_nearest(sr: pd.Series):
    if not np.isnan(sr['x']):
        return sr

    location = sr.name[0]
    date = sr.name[1]
    for near_location in nearest[location]:
        if not np.isnan(df.loc[near_location, date]['x']):
            sr['x'] = df.loc[near_location, date]['x']
            return sr
    return sr

df = df.apply(fill_by_nearest, axis=1)

Answer 2

您可以使用Apply功能：

def find_nearest(row): 
    for item in list(nearest[row['location']]):
        if len(df[lambda x: (x['location']==item) & (x['date']==row['date']) &(~pd.isnull(x['x']))]):
            return df[lambda x: (x['location']==item) & (x['date']==row['date']) &(~pd.isnull(x['x']))].x.values[0]
    
df = df.reset_index()        
df = df.assign(x = lambda x: x.apply(find_nearest,axis=1))

输出：

   location                date     x
0       AA3 2020-01-01 00:00:00   5.5
1       AA3 2020-01-01 01:00:00  10.2
2       AA3 2020-01-01 02:00:00   2.1
3       AA3 2020-01-01 03:00:00   2.3
4       AB1 2020-01-01 00:00:00  11.2
5       AB1 2020-01-01 01:00:00  10.2
6       AB1 2020-01-01 02:00:00   2.1
7       AB1 2020-01-01 03:00:00   4.0
8       AC0 2020-01-01 00:00:00   4.9
9       AC0 2020-01-01 01:00:00  15.2
10      AC0 2020-01-01 02:00:00  21.3
11      AC0 2020-01-01 03:00:00  11.3
12      AD1 2020-01-01 00:00:00   6.1
13      AD1 2020-01-01 01:00:00  15.2
14      AD1 2020-01-01 02:00:00  20.3
15      AD1 2020-01-01 03:00:00  11.3

根据另一个数据框中的最近位置填充熊猫数据框中的缺失值

2 个答案:

输出：