pandas python从其他数据中添加列

时间:2017-05-28 08:12:55

标签: python pandas dataframe

我想制作一个df,在df的末尾,有一些列可以从revious dfs获取数据。 我想检查其他dfs,如果第一列中的ID存在,如果存在,请检查它们出现在哪些列中,并将其添加到主df中的最后一列。

实施例

我的 MAIN DF 看起来像这样:

names    col1   col2   col3   total
 bbb      V      V      X      2
 ccc      V      X      X      1
 zzz      X      V      V      2
 qqq      X      V      X      1
 rrr      X      X      V      1

例如我还有两个dfs(一般来说还有两个以上的dfs,所以我想在循环中运行所有这些dfs), DF1

names    col1   col4   col5   total
 bbb      V      V      X      2
 ccc      V      X      X      1
 yyy      V      V      X      2

DF2

names    col6   col2   col7   total
 bbb      V      V      X      2
 ccc      X      X      V      1
 zzz      X      V      V      2

所以我想更新 MAIN DF ,如下所示:

names    col1   col2   col3   total   total_col1   total_col2
 bbb      V      V      X      2         DF1           DF1           
                                         DF2           DF2
 ccc      V      X      X      1         DF1                        
 zzz      X      V      V      2                       DF2     
 qqq      X      V      X      1
 rrr      X      X      V      1

我希望大熊猫有可能这样做,而且这个例子很清楚

编辑通知列:在DF1DF2中还有其他列不在原始主DF中,所以我只添加了列这也是最初的主要DF。

2 个答案:

答案 0 :(得分:1)

您可以使用更通用的另一个answer

首先创建list的{​​{1}} dfs并在列表理解过程中对其进行处理。然后concat他们在一起并再次使用join

DataFrames

编辑:

没有df_names = ['DF1', 'DF2'] cols = ['col1','col2','col3'] dfs = [DF1, DF2] dfs = [x.set_index('names')[cols] .replace({'V':df_names[i], 'X':np.nan}) .add_prefix('total_') for i, x in enumerate(dfs)] DF_ALL = pd.concat(dfs) print (DF_ALL) total_col1 total_col2 total_col3 names bbb DF1 DF1 NaN ccc DF1 NaN NaN yyy DF1 DF1 NaN bbb DF2 DF2 NaN ccc NaN NaN DF2 zzz NaN DF2 DF2 df = df.join(DF_ALL, on='names') print (df) names col1 col2 col3 total total_col1 total_col2 total_col3 0 bbb V V X 2 DF1 DF1 NaN 0 bbb V V X 2 DF2 DF2 NaN 1 ccc V X X 1 DF1 NaN NaN 1 ccc V X X 1 NaN NaN DF2 2 zzz X V V 2 NaN DF2 DF2 3 qqq X V X 1 NaN NaN NaN 4 rrr X X V 1 NaN NaN NaN 列的解决方案:

names

EDIT1:

使用排除列的解决方案 - 如果缺少列,则使用dropdf_names = ['DF1', 'DF2'] cols = ['col1','col2','col3'] dfs = [DF1, DF2] dfs = [x[cols].replace({'V':df_names[i], 'X':np.nan}) .add_prefix('total_') for i, x in enumerate(dfs)] DF_ALL = pd.concat(dfs).groupby(level=0).agg(lambda x: ', '.join(x.dropna().tolist())) print (DF_ALL) total_col1 total_col2 total_col3 bbb DF1, DF2 DF1, DF2 ccc DF1 DF2 yyy DF1 DF1 zzz DF2 DF2 df = pd.merge(df, DF_ALL, left_index=True, right_index=True, how='left') df[DF_ALL.columns] = df[DF_ALL.columns].fillna('') print (df) col1 col2 col3 total total_col1 total_col2 total_col3 bbb V V X 2 DF1, DF2 DF1, DF2 ccc V X X 1 DF1 DF2 zzz X V V 2 DF2 DF2 qqq X V X 1 rrr X X V 1 以及list,不会出现错误:

errors='ignore'

EDIT2:按intersection添加了对列的过滤:

dfs = [DF1, DF2]
df_names = ['DF1', 'DF2']

exclude_cols = ['total','col_aaa']


dfs = [x.drop(exclude_cols, axis=1, errors='ignore')
        .replace({'V':df_names[i], 'X':np.nan})
        .add_prefix('total_') for i, x in enumerate(dfs)]
DF_ALL = pd.concat(dfs).groupby(level=0).agg(lambda x: ', '.join(x.dropna().tolist()))
print (DF_ALL) 
    total_col1 total_col2 total_col3
bbb   DF1, DF2   DF1, DF2           
ccc        DF1                   DF2
yyy        DF1        DF1           
zzz                   DF2        DF2

df = pd.merge(df, DF_ALL, left_index=True, right_index=True, how='left')
df[DF_ALL.columns] = df[DF_ALL.columns].fillna('')
print (df)
    col1 col2 col3  total total_col1 total_col2 total_col3
bbb    V    V    X      2   DF1, DF2   DF1, DF2           
ccc    V    X    X      1        DF1                   DF2
zzz    X    V    V      2                   DF2        DF2
qqq    X    V    X      1                                 
rrr    X    X    V      1                                 

答案 1 :(得分:0)

它可以为你工作:

import pandas as pd
import numpy as np

MAIN_DF = [["bbb","V","V","X",2],
           ["ccc","V","X","X",1],
           ["zzz","X","V","V",2],
           ["qqq","X","V","X",1],
           ["rrr","X","X","V",1]]
MAIN_DF = pd.DataFrame(MAIN_DF, columns=["names", "col1","col2","col3","total"])

DF1 = [["bbb","V","V","X"],
       ["ccc","V","X","X"],
       ["yyy","V","V","X"]]
DF1 = pd.DataFrame(DF1, columns=["names", "col1","col2","col3"])
DF2 = [["bbb","V","V","X"],
       ["ccc","X","X","V"],
       ["zzz","X","V","V"]]
DF2 = pd.DataFrame(DF2, columns=["names", "col1","col2","col3"])


total_col = pd.DataFrame(data = np.zeros((MAIN_DF.shape[0],MAIN_DF.shape[1]-1)), columns=["names", "col1","col2","col3"])
total_col["names"]=MAIN_DF["names"] 


for i in xrange(total_col.shape[0]):
    name = total_col["names"][i]
    for j in xrange(DF1.shape[0]):
        if DF1["names"][j] == name:
            for col in DF1.columns[1:]:
                if DF1[col][j] == "V":
                    total_col[col][i] = "DF1"


for i in xrange(total_col.shape[0]):
    name = total_col["names"][i]
    for j in xrange(DF2.shape[0]):
        if DF2["names"][j] == name:
            for col in DF2.columns[1:]:
                if DF2[col][j] == "V":
                    if total_col[col][i] == "DF1":
                        total_col[col][i] = "DF1 DF2"
                    else:
                        total_col[col][i] = "DF2"

  names     col1     col2 col3
0   bbb  DF1 DF2  DF1 DF2    0
1   ccc      DF1        0  DF2
2   zzz        0      DF2  DF2
3   qqq        0        0    0
4   rrr        0        0    0