我有2个数据帧TRAIN
和TEST
。我想通过添加以下信息来更改TRAIN
:该信息不包含TEST
中但不包含TRAIN
中的所有项目(Y2,Y3)。
TRAIN = pd.DataFrame({'X' : [1,1,1,1,1,2,2,2,2,2],
'Y1': [1,1,1,1,1,1,0,0,0,0],
'Y4': [1,1,0,0,0,0,0,0,0,0]})
TEST = pd.DataFrame({'X' : [1,1,1,1,1,2,2,2,2],
'Y1': [1,1,0,1,0,1,0,0,0],
'Y2': [1,0,1,0,1,0,1,0,1],
'Y3': [1,1,0,1,1,0,0,0,0],
'Y4': [1,1,0,1,1,0,0,0,0]})
我想要:
TRAIN = pd.DataFrame({'X' : [1,1,1,1,1,2,2,2,2,2],
'Y1': [1,1,1,1,1,1,0,0,0,0],
'Y4': [1,1,1,1,1,1,0,0,0,0],
'Y2': [0,0,0,0,0,0,0,0,0,0],
'Y3': [0,0,0,0,0,0,0,0,0,0]})
我尝试过:
L_TRAIN = list(TRAIN)
L_TEST = list(TEST)
def Diff(li1, li2):
li_dif = [i for i in li1 + li2 if i not in li1]
return li_dif
L_DIFF = Diff(L_TRAIN, L_TEST)
TRAIN[L_DIFF] = 0
但是得到了:
KeyError: "['Y2' 'Y3'] not in index"
答案 0 :(得分:2)
pandas
不支持将值分配给多列,因此您需要一个一个地遍历它:
import pandas as pd
TRAIN = pd.DataFrame({'X' : [1,1,1,1,1,2,2,2,2,2],
'Y1': [1,1,1,1,1,1,0,0,0,0],
'Y4': [1,1,0,0,0,0,0,0,0,0]})
TEST = pd.DataFrame({'X' : [1,1,1,1,1,2,2,2,2],
'Y1': [1,1,0,1,0,1,0,0,0],
'Y2': [1,0,1,0,1,0,1,0,1],
'Y3': [1,1,0,1,1,0,0,0,0],
'Y4': [1,1,0,1,1,0,0,0,0]})
diff_cols = set(TEST.columns)-set(TRAIN.columns)
for i in diff_cols:
TRAIN[i] = 0
print(TRAIN)
输出:
X Y1 Y4 Y2 Y3
0 1 1 1 0 0
1 1 1 1 0 0
2 1 1 0 0 0
3 1 1 0 0 0
4 1 1 0 0 0
5 2 1 0 0 0
6 2 0 0 0 0