我有一个类似于以下内容的DataFrame:
def my_func(records):
if pd.isnull(records).sum() > 0:
return 0
return 1
agg = df.groupby('order_name')['discount_code'].agg(my_func)
import pandas as pd
import numpy as np
date = pd.date_range(start='2020-01-01', freq='H', periods=4)
locations = ["AA3", "AB1", "AD1", "AC0"]
x = [5.5, 10.2, np.nan, 2.3, 11.2, np.nan, 2.1, 4.0, 6.1, np.nan, 20.3, 11.3, 4.9, 15.2, 21.3, np.nan]
df = pd.DataFrame({'x': x})
df.index = pd.MultiIndex.from_product([locations, date], names=['location', 'date'])
df = df.sort_index()
df
索引值是位置代码和一天中的时间。我想用同一天和同一时间从最近位置开始的同一列的有效值填充列 x
location date
AA3 2020-01-01 00:00:00 5.5
2020-01-01 01:00:00 10.2
2020-01-01 02:00:00 NaN
2020-01-01 03:00:00 2.3
AB1 2020-01-01 00:00:00 11.2
2020-01-01 01:00:00 NaN
2020-01-01 02:00:00 2.1
2020-01-01 03:00:00 4.0
AC0 2020-01-01 00:00:00 4.9
2020-01-01 01:00:00 15.2
2020-01-01 02:00:00 21.3
2020-01-01 03:00:00 NaN
AD1 2020-01-01 00:00:00 6.1
2020-01-01 01:00:00 NaN
2020-01-01 02:00:00 20.3
2020-01-01 03:00:00 11.3
的缺失值,其中每个位置与其他位置的接近程度定义为
x
nearest = pd.DataFrame({"AA3": ["AA3", "AB1", "AD1", "AC0"],
"AB1": ["AB1", "AA3", "AC0", "AD1"],
"AD1": ["AD1", "AC0", "AB1", "AA3"],
"AC0": ["AC0", "AD1", "AA3", "AB1"]})
nearest
在此数据集中,列名是位置代码,每列下的行值按其名称作为列名的位置的接近程度指示其他位置。
如果最近的位置在同一天和同一小时也缺少值,那么我将在同一天和同一小时获取第二个最近的位置的值。如果缺少第二个最近的位置,则在同一天和同一小时的第三个最近的位置,依此类推。
所需的输出:
AA3 AB1 AD1 AC0
0 AA3 AB1 AD1 AC0
1 AB1 AA3 AC0 AD1
2 AD1 AC0 AB1 AA3
3 AC0 AD1 AA1 AB1
基于@kiona1018的建议,以下内容可以按预期工作,但速度较慢。
x
location date
AA3 2020-01-01 00:00:00 5.5
2020-01-01 01:00:00 10.2
2020-01-01 02:00:00 2.1
2020-01-01 03:00:00 2.3
AB1 2020-01-01 00:00:00 11.2
2020-01-01 01:00:00 10.2
2020-01-01 02:00:00 2.1
2020-01-01 03:00:00 4.0
AC0 2020-01-01 00:00:00 4.9
2020-01-01 01:00:00 15.2
2020-01-01 02:00:00 21.3
2020-01-01 03:00:00 11.3
AD1 2020-01-01 00:00:00 6.1
2020-01-01 01:00:00 15.2
2020-01-01 02:00:00 20.3
2020-01-01 03:00:00 11.3
答案 0 :(得分:1)
我同意Serial Lazer的观点,没有更整洁的熊猫/麻木死刑。要求有点复杂。在这种情况下,您应该发挥自己的作用。下面是一个示例。
nearest = pd.DataFrame({"AA3": ["AA3", "AB1", "AD1", "AC0"],
"AB1": ["AB1", "AA3", "AC0", "AD1"],
"AD1": ["AD1", "AC0", "AB1", "AA3"],
"AC0": ["AC0", "AD1", "AA3", "AB1"]})
def fill_by_nearest(sr: pd.Series):
if not np.isnan(sr['x']):
return sr
location = sr.name[0]
date = sr.name[1]
for near_location in nearest[location]:
if not np.isnan(df.loc[near_location, date]['x']):
sr['x'] = df.loc[near_location, date]['x']
return sr
return sr
df = df.apply(fill_by_nearest, axis=1)
答案 1 :(得分:1)
您可以使用Apply功能:
def find_nearest(row):
for item in list(nearest[row['location']]):
if len(df[lambda x: (x['location']==item) & (x['date']==row['date']) &(~pd.isnull(x['x']))]):
return df[lambda x: (x['location']==item) & (x['date']==row['date']) &(~pd.isnull(x['x']))].x.values[0]
df = df.reset_index()
df = df.assign(x = lambda x: x.apply(find_nearest,axis=1))
location date x
0 AA3 2020-01-01 00:00:00 5.5
1 AA3 2020-01-01 01:00:00 10.2
2 AA3 2020-01-01 02:00:00 2.1
3 AA3 2020-01-01 03:00:00 2.3
4 AB1 2020-01-01 00:00:00 11.2
5 AB1 2020-01-01 01:00:00 10.2
6 AB1 2020-01-01 02:00:00 2.1
7 AB1 2020-01-01 03:00:00 4.0
8 AC0 2020-01-01 00:00:00 4.9
9 AC0 2020-01-01 01:00:00 15.2
10 AC0 2020-01-01 02:00:00 21.3
11 AC0 2020-01-01 03:00:00 11.3
12 AD1 2020-01-01 00:00:00 6.1
13 AD1 2020-01-01 01:00:00 15.2
14 AD1 2020-01-01 02:00:00 20.3
15 AD1 2020-01-01 03:00:00 11.3