在Pandas Merge上放置WHERE子句

时间:2017-06-15 04:55:52

标签: python pandas merge where

我有两个pandas数据框,我试图在三个不同的键上合并在一起......有点儿。每个数据框都有一个性别列,以及一个country_destination列,我想要进行外部联接。一个数据帧具有age_bucket列,该列是表示年龄范围的字符串,例如, 45-49,50-54,55-59我已经在另一列中使用pandas apply方法变成了一个列表。我的问题是,当您在多个键上的两个数据框之间进行连接时,您是否也可以在某处执行where语句,以便能够连接不共享相同精确数据类型的列?例如,我可以说“在性别上加入这些表格,以及用户年龄在age_gender的age_list列的列表值中的country_destination列”

age_gender = pd.read_csv('data/age_gender_bkts.csv')
users = pd.read_csv('data/train_users_2.csv')

def getAgeList(row):
    clean_age = row['age_bucket'].replace('+', '')
    min_max = clean_age.split('-')

    if len(min_max) > 1:
        min_max = list(range(int(min_max[0]), int(min_max[1]) + 1))
    return min_max

age_gender['age_list'] = age_gender.apply(lambda x: getAgeList(x), axis=1)

combined_df = pd.merge(users, age_gender, on=['country_destination', 'gender'])

user.columns

Index(['id', 'date_account_created', 'timestamp_first_active',
       'date_first_booking', 'gender', 'age', 'signup_method', 'signup_flow',
       'language', 'affiliate_channel', 'affiliate_provider',
       'first_affiliate_tracked', 'signup_app', 'first_device_type',
       'first_browser', 'country_destination', 'lat_destination',
       'lng_destination', 'distance_km', 'destination_km2',
       'destination_language ', 'language_levenshtein_distance'],
      dtype='object')

age_gender.columns

Index(['age_bucket', 'country_destination', 'gender',
       'population_in_thousands', 'year', 'age_list'],
      dtype='object')

DataFrame示例 enter image description here enter image description here

1 个答案:

答案 0 :(得分:4)

我认为您需要按age_list列中的值展开行,然后merge

#get lengths of each list
l = age_gender['age_list'].str.len()
#get all columns without age_list
cols = age_gender.columns.difference(['age_list'])
#repeat values by lengths to new DataFrame
df = pd.DataFrame({col: np.repeat(age_gender[col].values, l) for col in cols})
#flattening lists, necessary convert to int, because merge not match
df['age'] = np.concatenate(age_gender['age_list'].values).astype(int)

#inner merge is default, so how='inner' is omit
df1 = pd.merge(df, users, on=['age', 'country_destination'])