使用Pandas处理数据集和过滤数据

时间:2018-04-03 09:00:22

标签: python pandas dataframe

我正在尝试使用从其他数据集收集的值过滤数据集。

例如

数据集1

Name          Group    Colour    Age   Title   JobRole

John Smith    1        NaN       NaN   NaN     NaN 

John Smith    2        NaN       NaN   NaN     NaN 

John Smith    3        NaN       NaN   NaN     NaN 

James Man     1        NaN       NaN   NaN     NaN   

.....

dataset2

Name          Colour    Age   Title   JobRole

John Smith    Red       35    Mr      SuperMan

James Man     Orange    21    Mr      SuperMan

.....

我想在dataset1中获取每个名称,然后过滤dataset2。最终目标是将所有数据添加到dataset1s NaN值中。

我遇到了过滤问题,我尝试了一些方法,但都产生了一个空的数据帧。

到目前为止尝试..

import pandas as pd
import numpy as np
groupDf = pd.read_excel("dataset1.xlsx")
newDf = pd.read_excel("dataset2.xlsx")


for name in newDf['Name']:
    filtered_data = groupDf[groupDf.Name == name]
    print(filtered_data)

输出

Empty DataFrame
Columns: [Name, Group]
Index: []

1 个答案:

答案 0 :(得分:0)

我认为需要merge使用左连接,combine_first需要替换NaN s:

#left join df2, if existing columns name is added _ to end
df = df1.merge(df2, on='Name', how='left', suffixes=('','_'))

#filter columns names
new_cols = df.columns[df.columns.str.endswith('_')]

#remove last char from column names
orig_cols = new_cols.str[:-1]
#dictionary for rename
d = dict(zip(new_cols, orig_cols))

#filter columns and replace NaNs by new appended columns
df[orig_cols] = df[orig_cols].combine_first(df[new_cols].rename(columns=d))
#remove appended columns 
df = df.drop(new_cols, axis=1)
#print (df)