在另一个数据帧的pandas基于数据帧的索引中反向添加整数

时间:2016-05-14 22:05:46

标签: python pandas dataframe

我有两个pandas DataFrames,并希望根据以下内容生成结果:

* DataFrame 1具有浮点数,并且第二个DataFrame中的值无关紧要。 两个DataFrame具有相同的列数,但df1有一些额外的行,因为它的索引中有更多的条目,遍布整个索引。

问题

如何获得df2形状的DataFrame,但是df1的值具有约束条件,即如果索引中的df1值不在df2中,则必须将df1值添加到先前的有效索引中(即非NaN)在df2中的该列中。 res_df显示从df1和df2。*

派生的结果

DataFrame 1

import pandas as pd

df1_col1 = pd.Series([2.5, .5, 1, 1, .5, .5, 2], index=[0.0, 2.5, 3.0, 4.0, 5.0, 5.5, 6])
df1_col2 = pd.Series([2, 2, 2, 1, 1], index=[0.0, 2.0, 4.0, 6.0, 7.0])
df1 = pd.concat([df1_col1, df1_col2], axis=1)

>>> df1
       0   1
0.0  2.5   2
2.0  NaN   2
2.5  0.5 NaN
3.0  1.0 NaN
4.0  1.0   2
5.0  0.5 NaN
5.5  0.5 NaN
6.0  2.0   1
7.0  NaN   1

DataFrame 2

df2_col1 = pd.Series(['val', 'val', 'val', 'val', 'val', 'val'], index=[0.0, 2.5, 3.0, 5.0, 5.5, 6])
df2_col2 = pd.Series(['val', 'val', 'val', 'val'], index=[0.0, 2.0, 6.0, 7.0])
df2 = pd.concat([df2_col1, df2_col2], axis=1)

>>> df2
       0    1
0.0  val  val
2.0  NaN  val
2.5  val  NaN
3.0  val  NaN
5.0  val  NaN
5.5  val  NaN
6.0  val  val
7.0  NaN  val

预期结果

res_col1 = pd.Series([2.5, .5, 2, .5, .5, 2], index=[0.0, 2.5, 3.0, 5.0, 5.5, 6])
res_col2 = pd.Series([2, 4, 1, 1], index=[0.0, 2.0, 6.0, 7.0])
res_df = pd.concat([res_col1, res_col2], axis=1)

>>> res_df
       0   1
0.0  2.5   2
2.0  NaN   4
2.5  0.5 NaN
3.0  2.0 NaN
5.0  0.5 NaN
5.5  0.5 NaN
6.0  2.0   1
7.0  NaN   1

我在Linux Ubuntu上使用pandas 0.18.0,解决方案需要适用于python 2.7.6和python 3.5.1。谢谢。

6 个答案:

答案 0 :(得分:1)

# Track what's missing, we'll loop over these
isin = df1.index.isin(df2.index)
missidx = df1.index[~isin]

# Base case in preparation for back-add
res_df = df1.reindex_like(df2)

# For each missing index
for i in missidx:
    # iterate over df2 columns
    # because we need to capture
    # its last valid index prior
    # the missing index we've found
    for j, col in df2.iteritems():
        # look for last valid index prior to i
        lvi = col.loc[:i].last_valid_index()
        # take value in df1 (now in res_df)
        # at last valid index from df2
        # and add to it the value in df1
        # at the missing index i
        res_df.at[lvi, j] += df1.at[i, j]

enter image description here

漂亮的功能

def pir_back_add(df1, df2):
    isin = df1.index.isin(df2.index)
    missidx = df1.index[~isin]

    res_df = df1.reindex_like(df2)

    for i in missidx:
        for j, col in df2.iteritems():
            lvi = col.loc[:i].last_valid_index()
            res_df.at[lvi, j] += df1.at[i, j]

    return res_df

时序

我的解决方案大大超过了所有其他人。

piRSquared
1000 loops, best of 3: 677 µs per loop
Kartik
100 loops, best of 3: 3.06 ms per loop
ptrj
100 loops, best of 3: 4.55 ms per loop
Alberto Garcia-Raboso
100 loops, best of 3: 2.81 ms per loop
Alex
100 loops, best of 3: 2.28 ms per loop

答案 1 :(得分:1)

考虑到我的另一个答案,我意识到有一个更好的方法来处理这个问题。您仍然希望使用pd.cut()df1.index创建分类,但您希望为每列分别创建分箱 - 同时使用df2.indexdf1行的索引列中没有NaN。这是代码。

from __future__ import print_function
import pandas as pd

df1_col1 = pd.Series([2.5, .5, 1, 1, .5, .5, 2],
                     index=[0.0, 2.5, 3.0, 4.0, 5.0, 5.5, 6])
df1_col2 = pd.Series([2, 2, 2, 1, 1],
                     index=[0.0, 2.0, 4.0, 6.0, 7.0])
df1 = pd.concat([df1_col1, df1_col2], axis=1)

df2_col1 = pd.Series(['val', 'val', 'val', 'val', 'val', 'val'],
                     index=[0.0, 2.5, 3.0, 5.0, 5.5, 6])
df2_col2 = pd.Series(['val', 'val', 'val', 'val'],
                     index=[0.0, 2.0, 6.0, 7.0])
df2 = pd.concat([df2_col1, df2_col2], axis=1)

res_df = pd.DataFrame(index=df2.index)
for col, values in df1.iteritems():
    bin_bdrys = list(df1[col].dropna().index.intersection(df2.index))
    bin_bdrys.append(df2.index[-1] + 1)
    bins = pd.cut(df1.index, bin_bdrys, right=False, labels=bin_bdrys[:-1])
    res_df[col] = df1[col].groupby(bins).sum().reindex_like(df2)
print(res_df)

输出:

       0    1
0.0  2.5  2.0
2.0  NaN  4.0
2.5  0.5  NaN
3.0  2.0  NaN
5.0  0.5  NaN
5.5  0.5  NaN
6.0  2.0  1.0
7.0  NaN  1.0

答案 2 :(得分:1)

在这种情况下,我没有看到一种优雅的方式。在df1创建之前的某个阶段,更有可能获得惯用且优雅的解决方案。

在这里,似乎唯一的方法是迭代df1的列。如果df1.index中的额外元素相对较少,那么您的解决方案会非常快。如果df1.index.difference(df2.index)很大,那么以下分组技巧可能会有用:

说,s1s2是,df1df2的列:

s1 = pd.Series(list(range(7)), index=[1.0, 1.5, 1.6, 2.0, 3.0, 3.5, 4.0])
s2 = pd.Series([1], index=[1.0, 2.0, 3.0, 4.0])
s1
Out[198]: 
1.0    0
1.5    1
1.6    2
2.0    3
3.0    4
3.5    5
4.0    6
dtype: int64

创建一个临时系列s进行分组。 s的值是所有s2条目的有效s1索引。

s = pd.Series([np.nan], index=s1.index)
s[s2.index] = s2.index
s = s.fillna(method='ffill')

s
Out[202]: 
1.0    1.0
1.5    1.0
1.6    1.0
2.0    2.0
3.0    3.0
3.5    3.0
4.0    4.0
dtype: float64

诀窍的作用如下(注意结果索引为s2.index):

s1.groupby(s).sum()
Out[1203]: 
1.0    3
2.0    3
3.0    9
4.0    6
dtype: int64

整个代码

必须谨慎对待nan。我从说明和您的解决方案中推断nandf1df2的位置基本相同。如果没有,代码可能需要一些修改。

我还假设df1df2的索引是单调并且包含唯一值。

# Filling nan's that may interfere with the results
extra_idx = df1.index.difference(df2.index)
df1.loc[extra_idx] = df1.loc[extra_idx].fillna(0)
# If nan's in df1 and df2 coincide, the following would also work:
# df1 = df1.fillna(0)

result_cols = []
s = pd.Series(index=df1.index)

for col in df1.columns:
    c1 = df1[col]
    c2 = df2[col].dropna()

    s[:] = np.NaN
    s[c2.index] = c2.index
    s = s.fillna(method='ffill')

    out_col = c1.groupby(s).sum()
    result_cols.append(out_col)

result = pd.concat(result_cols, axis=1)

使用df1df2个形状(10000,10)和(7000,10),这几乎比解决方案快100倍。

答案 3 :(得分:0)

这是一种解决方法,但我通过以下功能解决了这个问题:

def back_add(df1, df2):
    cols1 = [df1.iloc[:, x].dropna() for x in range(len(df1.columns))]
    cols2 = [df2.iloc[:, x].dropna() for x in range(len(df2.columns))]

    for i, ser in enumerate(cols1):
        for j, val in enumerate(ser):
            if ser.index[j] not in cols2[i].index:
                ser.at[ser.iloc[:j].last_valid_index()] += val
                ser.iat[j] = float('nan')
        ser = ser.dropna()
    return pandas.concat(cols1, axis=1).dropna(how='all')

看起来应该有更优雅的方式来做到这一点。

答案 4 :(得分:0)

pd.cut()可让您使用df2.indexdf1.index创建间隔。然后你可以groupby这些间隔和总和。

from __future__ import print_function
import numpy as np
import pandas as pd

df1_col1 = pd.Series([2.5, .5, 1, 1, .5, .5, 2],
                     index=[0.0, 2.5, 3.0, 4.0, 5.0, 5.5, 6])
df1_col2 = pd.Series([2, 2, 2, 1, 1],
                     index=[0.0, 2.0, 4.0, 6.0, 7.0])
df1 = pd.concat([df1_col1, df1_col2], axis=1)

df2_col1 = pd.Series(['val', 'val', 'val', 'val', 'val', 'val'],
                     index=[0.0, 2.5, 3.0, 5.0, 5.5, 6])
df2_col2 = pd.Series(['val', 'val', 'val', 'val'],
                     index=[0.0, 2.0, 6.0, 7.0])
df2 = pd.concat([df2_col1, df2_col2], axis=1)

bin_bdrys = list(df2.index)
bin_bdrys.append(df2.index[-1] + 1)
bins = pd.cut(df1.index, bin_bdrys, right=False, labels=df2.index)

res_df = df1.groupby(bins).sum()

这几乎可以得到你想要的东西:

print(res_df)

#        0    1
# 0.0  2.5  2.0
# 2.0  NaN  2.0
# 2.5  0.5  NaN
# 3.0  2.0  2.0
# 5.0  0.5  NaN
# 5.5  0.5  NaN
# 6.0  2.0  1.0
# 7.0  NaN  1.0

问题是df1.loc[4.0, 1]已添加到res_df.loc[3.0, 1]。但是df1.loc[3.0, 1]NaN ...您可以轻松识别出现这种情况:

incorrect = (res_df.notnull() & df1.isnull()).dropna()
print(incorrect)

#          0      1
# 0.0  False  False
# 2.0  False  False
# 2.5  False  False
# 3.0  False   True
# 5.0  False  False
# 5.5  False  False
# 6.0  False  False
# 7.0  False  False

现在让我们纠正它:

# Iterate over columns
for col, values in incorrect.iteritems():
    # Get the indices of the entries that are wrong
    old_idx = values.nonzero()[0]
    # Get the valid indices
    valid_idx = df1[col].notnull().nonzero()[0]
    # Get the previous valid index for each wrong entry
    new_idx = np.searchsorted(valid_idx, old_idx) - 1
    # Add the wrong entry to the correct position, and `NaN` the former
    for i, j in zip(old_idx, new_idx):
        res_df.iloc[j, col] += res_df.iloc[i, col]
        res_df.iloc[i, col] = np.nan

print(res_df)

#        0    1
# 0.0  2.5  2.0
# 2.0  NaN  4.0
# 2.5  0.5  NaN
# 3.0  2.0  NaN
# 5.0  0.5  NaN
# 5.5  0.5  NaN
# 6.0  2.0  1.0
# 7.0  NaN  1.0

在上面的for循环中查看不同变量的值是有益的。第一列没有错误,所以

old_idx = []
valid_idx = [0, 2, 3, 4, 5, 6, 7]
new_idx = []

(您可以添加if语句以避免循环此列。对于第二列,我们得到

old_idx = [3]
valid_idx = [0, 1, 4, 7, 8]
new_idx = [1]

因此res_df.iloc[3, 1]已添加到res_df.iloc[1, 1]res_df.iloc[3, 1]已重置为NaN

答案 5 :(得分:0)

让我们试试这个:

# Step 1: Merge df1 and df2 on df2 (to make the shape the same):
df_merge = df2.join(df1, lsuffix='_x', rsuffix='_y')

# Step 2: Bit of indexing elbow grease:
for col in df2.columns:
    non_nan = df_merge[str(col)+'_x'].notnull()
    df_merge.loc[non_nan,str(col)+'_x'] = df_merge.loc[non_nan,str(col)+'_y']

# Step 3: Drop the columns from df1:
df1_cols = [str(col)+'_x' for col in df1.columns]
df_merge.drop(df1_cols, axis=1, inplace=True)
df_merge.columns = df2.columns

这是否解决了所有用例?