我有一个包含大量列的数据集,我只想回填现有行值所缺少的行。我试图用这种逻辑来填充:如果'school'和'country'是相同的字符串,则将'state'值替换为空的'state'列。
这里是一个例子。问题是它结合了其他行,但我尝试不拆分行。有办法吗?谢谢!
样本数据:
import pandas as pd
school = ['Univ of CT','Univ of CT','Oxford','Oxford','ABC Univ']
name = ['John','Matt','John','Ashley','John']
country = ['US','US','UK','UK','']
state = ['CT','','','ENG','']
df = pd.DataFrame({'school':school,'country':country,'state':state,'name':name})
df['school'] = df['school'].str.upper()
以上数据可提供以下预览:
school country state name
UNIV OF CT US CT John
UNIV OF CT US Matt
OXFORD UK John
OXFORD UK ENG Ashley
ABC UNIV John
我正在寻找这样的输出:
school country state name
UNIV OF CT US CT John
UNIV OF CT US CT Matt
OXFORD UK ENG John
OXFORD UK ENG Ashley
ABC UNIV John
我尝试的代码:
df = df.fillna('')
df = df.reset_index().groupby(['school','country']).agg(';'.join)
df = pd.DataFrame(df).reset_index()
len(df)
答案 0 :(得分:1)
您可以编写一个小函数来基本查询州(如果该州根据学校和国家/地区为空白)。
def find_state(school, country, state):
if len(state) > 0:
return state
found_state = df['state'][(df['school'] == school) & (df['country'] == country)]
return max(found_state)
因此,完整的示例如下:
import pandas as pd
school = ['Univ of CT','Univ of CT','Oxford','Oxford','ABC Univ']
name = ['John','Matt','John','Ashley','John']
country = ['US','US','UK','UK','']
state = ['CT','','','ENG','']
df = pd.DataFrame({'school':school,'country':country,'state':state,'name':name})
df['school'] = df['school'].str.upper()
def find_state(school, country, state):
if len(state) > 0:
return state
found_state = df['state'][(df['school'] == school) & (df['country'] == country)]
return max(found_state)
df['state_new'] = [find_state(school, country, state) for school, country, state in
df[['school','country','state']].values]
print(df)
school country state name state_new
0 UNIV OF CT US CT John CT
1 UNIV OF CT US Matt CT
2 OXFORD UK John ENG
3 OXFORD UK ENG Ashley ENG
4 ABC UNIV John
答案 1 :(得分:0)
尝试
首先尝试将空白空间转换为nan,然后只需使用ffill()
和bfill()
df = pd.DataFrame({'school':school,'country':country,'state':state,'name':name})
df['school'] = df['school'].str.upper()
df['state'] = df['state'].astype(str).replace('',np.nan)
df['state'] = df.groupby(['school', 'country'])['state'].transform(lambda x: x.ffill()).transform(lambda y: y.bfill())
print(df)
school country state name
UNIV OF CT US CT John
UNIV OF CT US CT Matt
OXFORD UK ENG John
OXFORD UK ENG Ashley
ABC UNIV NaN John