熊猫:如果2个列字符串相同,则填充行

时间:2019-08-23 17:44:23

标签: python-3.x pandas

我有一个包含大量列的数据集,我只想回填现有行值所缺少的行。我试图用这种逻辑来填充:如果'school'和'country'是相同的字符串,则将'state'值替换为空的'state'列。

这里是一个例子。问题是它结合了其他行,但我尝试不拆分行。有办法吗?谢谢!

样本数据:

import pandas as pd
school = ['Univ of CT','Univ of CT','Oxford','Oxford','ABC Univ']
name = ['John','Matt','John','Ashley','John']
country = ['US','US','UK','UK','']
state = ['CT','','','ENG','']
df = pd.DataFrame({'school':school,'country':country,'state':state,'name':name})
df['school'] = df['school'].str.upper()

以上数据可提供以下预览:

school      country state   name
UNIV OF CT  US      CT     John
UNIV OF CT  US             Matt
OXFORD      UK             John
OXFORD      UK      ENG    Ashley
ABC UNIV                   John

我正在寻找这样的输出:

school      country state   name
UNIV OF CT  US      CT     John
UNIV OF CT  US      CT     Matt
OXFORD      UK      ENG    John
OXFORD      UK      ENG    Ashley
ABC UNIV                   John

我尝试的代码:

df = df.fillna('')
df = df.reset_index().groupby(['school','country']).agg(';'.join) 
df = pd.DataFrame(df).reset_index()
len(df)

2 个答案:

答案 0 :(得分:1)

您可以编写一个小函数来基本查询州(如果该州根据学校和国家/地区为空白)。

def find_state(school, country, state):
    if len(state) > 0:
        return state
    found_state = df['state'][(df['school'] == school) & (df['country'] == country)]
    return max(found_state)

因此,完整的示例如下:

import pandas as pd
school = ['Univ of CT','Univ of CT','Oxford','Oxford','ABC Univ']
name = ['John','Matt','John','Ashley','John']
country = ['US','US','UK','UK','']
state = ['CT','','','ENG','']
df = pd.DataFrame({'school':school,'country':country,'state':state,'name':name})
df['school'] = df['school'].str.upper()

def find_state(school, country, state):
    if len(state) > 0:
        return state
    found_state = df['state'][(df['school'] == school) & (df['country'] == country)]
    return max(found_state)

df['state_new'] = [find_state(school, country, state) for school, country, state in 
                   df[['school','country','state']].values]
print(df)

    school       country  state  name     state_new
0   UNIV OF CT    US       CT    John     CT
1   UNIV OF CT    US             Matt     CT
2   OXFORD        UK             John     ENG
3   OXFORD        UK       ENG   Ashley   ENG
4   ABC UNIV                     John   

答案 1 :(得分:0)

尝试

首先尝试将空白空间转换为nan,然后只需使用ffill()bfill()

df = pd.DataFrame({'school':school,'country':country,'state':state,'name':name})
df['school'] = df['school'].str.upper()

df['state'] = df['state'].astype(str).replace('',np.nan)
df['state'] = df.groupby(['school', 'country'])['state'].transform(lambda x: x.ffill()).transform(lambda y: y.bfill())
print(df)

    school country state    name
UNIV OF CT      US    CT    John
UNIV OF CT      US    CT    Matt
    OXFORD      UK   ENG    John
    OXFORD      UK   ENG  Ashley
  ABC UNIV           NaN    John