折叠列并插入新行?

时间:2016-12-19 14:29:03

标签: python pandas

我的数据:

df
Out[79]: 
    INC Theme Theme_Hat TRAIN_TEST
0   123     A       NaN      TRAIN
1   124     A       NaN      TRAIN
2   125     A       NaN      TRAIN
3   126     A       NaN      TRAIN
4   127     A       NaN      TRAIN
5   128     A       NaN      TRAIN
6   129     A       NaN      TRAIN
7   130     A       NaN      TRAIN
8   131     B       NaN      TRAIN
9   132     B         B       TEST
10  133     B         A       TEST
11  134     B         A       TEST
12  135     B         A       TEST

我正在尝试将Theme_Hat列折叠到Theme列,同时保留TRAIN_TEST指标。我在下面使用了for循环,但我的直觉告诉我必须有更多pandas - esque解决方案。以下尝试未达到我想要的输出,因为TESTdf中不断重复,而不是保留的TRAIN信息。这是我想要的输出:

Out[81]: 
    INC Theme TRAIN_TEST
0   123     A      TRAIN
1   124     A      TRAIN
2   125     A      TRAIN
3   126     A      TRAIN
4   127     A      TRAIN
5   128     A      TRAIN
6   129     A      TRAIN
7   130     A      TRAIN
8   131     B      TRAIN
9   132     B      TRAIN
10  132     B      TEST
11  133     B      TRAIN
12  133     A      TEST
13  134     B      TRAIN
14  134     A      TEST
15  135     B      TRAIN
16  135     A      TEST

这是我到目前为止所做的:

# copy so we can reference the original dataframe as rows are inserted into df
df2 = df.copy(deep = True)
no_nulls = df2[df2['Theme_Hat'].notnull()]

# get rid of the Theme_Hat column for final dataframe (since we're migrating that info into Theme)
df.drop('Theme_Hat', inplace = True, axis = 1)

# I'm sure there's some pandas built-in functionality that 
# can handle this better than a for loop
for idx in no_nulls.index:
    # reference the unchanged df2 for INC, Theme_Hat, and TRAIN_TEST info
    new_row = pd.DataFrame({"INC": df2.loc[idx, 'INC'], 
                            "Theme": df2.loc[idx, 'Theme_Hat'],
                            "TRAIN_TEST": df2.loc[idx, 'TRAIN_TEST']}, index = [idx+1])
    print(new_row, '\n\n')

    # insert the new row right after the row at the current index
    df = pd.concat([df.ix[:idx], new_row, df.ix[idx+1:]]).reset_index(drop = True)

2 个答案:

答案 0 :(得分:2)

使用pd.lreshape默认情况下自动删除NaNs。然后,您可以将所考虑的两个列组合在一起,将它们的值组合在一个列中。最后,根据INC列值对这些值进行排序。

pd.lreshape(df, {'Theme': ['Theme','Theme_Hat']}).sort_values('INC').reset_index(drop=True)

enter image description here

答案 1 :(得分:1)

您可以set_index使用stack

sep=;
1;2

使用melt的解决方案,按drop删除列,dropna删除print (df.set_index(['INC','TRAIN_TEST']) .stack() .reset_index(level=2, drop=True) .reset_index(name='Theme')) INC TRAIN_TEST Theme 0 123 TRAIN A 1 124 TRAIN A 2 125 TRAIN A 3 126 TRAIN A 4 127 TRAIN A 5 128 TRAIN A 6 129 TRAIN A 7 130 TRAIN A 8 131 TRAIN B 9 132 TEST B 10 132 TEST B 11 133 TEST B 12 133 TEST A 13 134 TEST B 14 134 TEST A 15 135 TEST B 16 135 TEST A

NaN