使用其他DataFrame中包含的值填充NaN值

时间:2019-12-01 20:31:45

标签: python pandas dataframe

问题看起来像这样:

问题

我有一个具有2级多索引的数据帧left,表示事件tpc发生在时间区域onset中的点mc上。每个事件都在(staff, voice)定义的层中发生:

            mc onset  staff  voice  tpc  dynamics  chords
section ix                                               
0       0    0     0      2      1    0       NaN     NaN
        1    0     0      2      1    0       NaN     NaN
        2    0     0      1      1    0       NaN     NaN
        3    0     0      1      1    4       NaN     NaN
        4    0     0      1      1    1       NaN     NaN
        5    0     0      1      1    0       NaN     NaN
        6    0   3/4      2      2    1       NaN     NaN
        7    0   3/4      2      1    1       NaN     NaN

然后,存在带有其他事件right的数据帧('dynamic', 'chords'),需要将其填充到left中:

   mc onset  staff  voice dynamics chords
0   0     0      1      1        f    NaN
1   0     0      1      1      NaN      I
2   0   1/2      2      1        p    NaN
3   0   3/4      1      1      NaN     I6
4   0   3/4      2      1      NaN    I64

填写规则如下:

  1. 来自right的所有事件都必须显示在left
  2. 如果它们在同一层中同时发生left个事件,则为这些事件填写left的相应列(即,加入['mc', 'onset', 'staff', 'voice'];例如第0、1行,4)
  3. 如果它们与同一left中的staff个事件同时发生,则为这些事件填写left的相应列(即,加入['mc', 'onset', 'staff'];例如第4行)
  4. 否则,如果它们与其他层中的left事件同时发生,则为这些事件填写left的相应列(即,加入['mc', 'onset'],例如第3行)
  5. 否则,如果它们不与left事件同时发生,则发出警告并保留以进行进一步处理(例如,第2行)
  6. 如果right中两个相同类型的事件同时发生,则发出警告并连接值(例如第3行和第4行)

预期结果

     mc onset  staff  voice  tpc dynamics chords
0 0   0     0      2      1    0      NaN    NaN
  1   0     0      2      1    0      NaN    NaN
  2   0     0      1      1    0      f        I
  3   0     0      1      1    4      f        I
  4   0     0      1      1    1      f        I
  5   0     0      1      1    0      f        I
  6   0   3/4      2      2    1      NaN     I6
  7   0   3/4      2      1    1      NaN  I6I64
WARNING: These events could not be attached:
   mc onset  staff  voice dynamics chords
2   0   1/2      2      1        p    NaN
WARNING: These events are simultaneous:
   mc onset  staff  voice dynamics chords
3   0   3/4      1      1      NaN     I6
4   0   3/4      2      1      NaN    I64

尝试1

由于我想避免迭代right的方法,因此我尝试了以下方法:

left_features = ['mc', 'onset', 'staff', 'voice']
right_features = ['dynamics', 'chords']
join_on = [['mc', 'onset', 'staff', 'voice'], ['mc', 'onset', 'staff'], ['mc', 'onset']]
for on in join_on:
    match = right[on + right_features].merge(left[left_features], on=on, left_index=True)
    left_ix = match.index
    left.loc[left_ix, match.columns] = match
    # left.loc[left_ix].fillna(match, inplace=True)
    right_ix = right.merge(left[left_features], on=on, right_index=True).index
    right.drop(right_ix, inplace=True)
    if len(right) == 0:
        break
if len(right) > 0:
    print("WARNING: These events could not be attached:")
    print(right)

此方法不起作用,因为在第一次合并后,match如下所示:

     mc onset  staff  voice dynamics chords  tpc
0 2   0     0      1      1        f    NaN    0
  3   0     0      1      1        f    NaN    4
  4   0     0      1      1        f    NaN    1
  5   0     0      1      1        f    NaN    0
  2   0     0      1      1      NaN      I    0
  3   0     0      1      1      NaN      I    4
  4   0     0      1      1      NaN      I    1
  5   0     0      1      1      NaN      I    0
  7   0   3/4      2      1      NaN    I64    1

由于match的索引不是唯一的,因此赋值left = match不能完全正常工作(结果中缺少dynamics),并且使用fillna的注释方法默默地什么也没做。另外,我还要进行两次相同的合并,以使left_index正确分配,然后right_index丢弃匹配的行。

尝试2

面对这些问题,我在连接之前对right进行了预处理,以将同时发生的事件合并为一行:

def unite_vals(df):
    r = pd.Series(index=right_features)
    for col in right_features:
        u = df[col][df[col].notna()].unique()
        if len(u) > 1:
            r[col] = ''.join(str(val) for val in u)
            print(f"WARNING:Two simultaneous events in row {df.iloc[0].name}")
        elif len(u) == 1:
            r[col] = u[0]
    return r

left_features = ['mc', 'onset', 'staff', 'voice']
right_features = ['dynamics', 'chords']
on = ['mc', 'onset']
right = right.groupby(on).apply(unite_vals).reset_index()
match = right.merge(left[left_features], on=on, left_index=True)
left_ix = match.index
left.loc[left_ix, match.columns] = match
# left.loc[left_ix].fillna(match, inplace=True)
right_ix = right.merge(left[left_features], on=on, right_index=True).index
right.drop(right_ix, inplace=True)
if len(right) > 0:
    print("WARNING: These events could not be attached:")
    print(right)

(由于某种未知的原因,用fillna注释掉的方法再次无济于事。执行相同的合并两次的问题仍然存在。)结果是我可以接受的一种方法,但是确实可以不能区分right的各层,因此看起来像这样:

     mc onset  staff  voice  tpc dynamics chords
0 0   0     0      2      1    0        f      I
  1   0     0      2      1    0        f      I
  2   0     0      1      1    0        f      I
  3   0     0      1      1    4        f      I
  4   0     0      1      1    1        f      I
  5   0     0      1      1    0        f      I
  6   0   3/4      2      2    1      NaN  I6I64
  7   0   3/4      2      1    1      NaN  I6I64
WARNING:Two simultaneous events at:
   mc onset
3   0   3/4
WARNING: These events could not be attached:
   mc onset dynamics chords
1   0   1/2        p    NaN

通常如何解决?

以下是复制的源代码:

import pandas as pd
import numpy as np
from fractions import Fraction
left_dict = {'mc': {(0, 0): 0,
  (0, 1): 0,
  (0, 2): 0,
  (0, 3): 0,
  (0, 4): 0,
  (0, 5): 0,
  (0, 6): 0,
  (0, 7): 0},
 'onset': {(0, 0): Fraction(0, 1),
  (0, 1): Fraction(0, 1),
  (0, 2): Fraction(0, 1),
  (0, 3): Fraction(0, 1),
  (0, 4): Fraction(0, 1),
  (0, 5): Fraction(0, 1),
  (0, 6): Fraction(3, 4),
  (0, 7): Fraction(3, 4)},
 'staff': {(0, 0): 2,
  (0, 1): 2,
  (0, 2): 1,
  (0, 3): 1,
  (0, 4): 1,
  (0, 5): 1,
  (0, 6): 2,
  (0, 7): 2},
 'voice': {(0, 0): 1,
  (0, 1): 1,
  (0, 2): 1,
  (0, 3): 1,
  (0, 4): 1,
  (0, 5): 1,
  (0, 6): 2,
  (0, 7): 1},
 'tpc': {(0, 0): 0,
  (0, 1): 0,
  (0, 2): 0,
  (0, 3): 4,
  (0, 4): 1,
  (0, 5): 0,
  (0, 6): 1,
  (0, 7): 1},
 'dynamics': {(0, 0): np.nan,
  (0, 1): np.nan,
  (0, 2): np.nan,
  (0, 3): np.nan,
  (0, 4): np.nan,
  (0, 5): np.nan,
  (0, 6): np.nan,
  (0, 7): np.nan},
 'chords': {(0, 0): np.nan,
  (0, 1): np.nan,
  (0, 2): np.nan,
  (0, 3): np.nan,
  (0, 4): np.nan,
  (0, 5): np.nan,
  (0, 6): np.nan,
  (0, 7): np.nan}}
left = pd.DataFrame.from_dict(left_dict)

right_dict = {'mc': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
 'onset': {0: Fraction(0, 1),
  1: Fraction(0, 1),
  2: Fraction(1, 2),
  3: Fraction(3, 4),
  4: Fraction(3, 4)},
 'staff': {0: 1, 1: 1, 2: 2, 3: 1, 4: 2},
 'voice': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
 'dynamics': {0: 'f', 1: np.nan, 2: 'p', 3: np.nan, 4: np.nan},
 'chords': {0: np.nan, 1: 'I', 2: np.nan, 3: 'I6', 4: 'I64'}}
right = pd.DataFrame.from_dict(right_dict)

attempt1 = True
if attempt1:
    left_features = ['mc', 'onset', 'staff', 'voice', 'tpc']
    right_features = ['dynamics', 'chords']
    join_on = [['mc', 'onset', 'staff', 'voice'], ['mc', 'onset', 'staff'], ['mc', 'onset']]
    for on in join_on:
        match = right[on + right_features].merge(left[left_features], on=on, left_index=True)
        left_ix = match.index
        left.loc[left_ix, match.columns] = match
        #left.loc[left_ix].fillna(match, inplace=True)
        right_ix = right.merge(left[left_features], on=on, right_index=True).index
        right.drop(right_ix, inplace=True)
        if len(right) == 0:
            break
    if len(right) > 0:
        print("WARNING: These events could not be attached:")
        print(right)
    print(left)
else:
    def unite_vals(df):
        r = pd.Series(index=right_features)
        for col in right_features:
            u = df[col][df[col].notna()].unique()
            if len(u) > 1:
                r[col] = ''.join(str(val) for val in u)
                print("WARNING:Two simultaneous events at:")
                print(df.iloc[:1][['mc', 'onset']])
            elif len(u) == 1:
                r[col] = u[0]
        return r

    left_features = ['mc', 'onset', 'staff', 'voice']
    right_features = ['dynamics', 'chords']
    on = ['mc', 'onset']
    right = right.groupby(on).apply(unite_vals).reset_index()
    match = right.merge(left[left_features], on=on, left_index=True)
    left_ix = match.index
    left.loc[left_ix, match.columns] = match
    # left.loc[left_ix].fillna(match, inplace=True)
    right_ix = right.merge(left[left_features], on=on, right_index=True).index
    right.drop(right_ix, inplace=True)
    if len(right) > 0:
        print("WARNING: These events could not be attached:")
        print(right)
    print(left)

1 个答案:

答案 0 :(得分:0)

事实证明,解决我的问题的最简单方法是使用循环:

isnan = lambda num:  num != num
right_features = ['dynamics', 'chords']
for i, r in right.iterrows():
    same_os = left.loc[(left.mc == r.mc) & (left.onset == r.onset)]
    if len(same_os) > 0:
        same_staff = same_os.loc[same_os.staff == r.staff]
        same_voice = same_staff.loc[same_staff.voice == r.voice]
        if len(same_voice) > 0:
            fill = same_voice
        elif len(same_staff) > 0:
            fill = same_staff
        else:
            fill = same_os

        for f in right_features:
            if not isnan(r[f]):
                F = left.loc[fill.index, f]
                notna = F.notna()
                if notna.any():
                    print(f"WARNING:Feature existed and was concatenated: {F[notna]}")
                    left.loc[F[notna].index, f] += r[f]
                    left.loc[F[~notna].index, f] = r[f]
                else:
                    left.loc[fill.index, f] = r[f]
    else:
        print(f"WARNING:Event could not be attached: {r}")