场景有点复杂。请让我开始解释:
有一个如下数据框:
import pandas as pd
data = [['CAROLINA GEORGE SCHOOL',['carolina','george'], ['school']],
['CAROLINA KINDER SCHOOL',['carolina','kinder'],['school']],
['GEORGE KINDER SCHOOL',['george','kinder'],['school']],
['CAROLINA SCHOOL',['carolina'], ['school']],
['GEORGE SCHOOL',['george'],['school']],
['GEORGE EDUCATION',['george'],['education']]
]
df = pd.DataFrame(data,columns=['name','first','second'])
df['len'] = df['first'].str.len()
df.sort_values(by='len', inplace=True)
第一列对应全名,该名称分为两部分,并以列表形式存储在接下来的两列中。这个想法是找到列表的根或最基本的形式,并摆脱它。
例如,如果我们有列表['george','kinder'],['carolina','george'],则其父级为['george'],因为两个列表中都包含george
。同样,如果我们有列表['carolina','george'],['carolina','kinder'],则其父级为['carolina']。
这是最基本的情况,可能是父元素可能由多个元素组成,而不只是一个元素。
这个想法是找到并摆脱它。不确定数据框是否是解决此问题的最佳方法。
基本数据框如下:
name first second len
3 CAROLINA SCHOOL [carolina] [school] 1
4 GEORGE SCHOOL [george] [school] 1
5 GEORGE EDUCATION [george] [education] 1
0 CAROLINA GEORGE SCHOOL [carolina, george] [school] 2
1 CAROLINA KINDER SCHOOL [carolina, kinder] [school] 2
2 GEORGE KINDER SCHOOL [george, kinder] [school] 2
预期结果如下:
name first second len
5 GEORGE EDUCATION [george] [education] 1
0 CAROLINA GEORGE SCHOOL [carolina, george] [school] 2
1 CAROLINA KINDER SCHOOL [carolina, kinder] [school] 2
2 GEORGE KINDER SCHOOL [george, kinder] [school] 2
请注意,GEORGE EDUCATION
行仍然存在,因为second
列值具有一个值为education
而不是school
的列表。因此,它将仅删除second
列中具有相同值的父对象。
谢谢
答案 0 :(得分:0)
您需要将first
列分解为c1
和c2
;然后合并潜在的父母和孩子的名字。
这是工作代码:
import pandas as pd
data = [['CAROLINA GEORGE SCHOOL',['carolina','george'], ['school']],
['CAROLINA KINDER SCHOOL',['carolina','kinder'],['school']],
['GEORGE KINDER SCHOOL',['george','kinder'],['school']],
['CAROLINA SCHOOL',['carolina'], ['school']],
['GEORGE SCHOOL',['george'],['school']],
['GEORGE EDUCATION',['george'],['education']]
]
df = pd.DataFrame(data,columns=['name','first','second'])
df['len'] = df['first'].str.len()
df.sort_values(by='len', inplace=True)
df['c1'] = df['first'].transform(lambda x: x[0])
pdf = df[df.len == 1]
cdf = df[df.len == 2]
cdf.loc[:,'c2'] = cdf['first'].transform(lambda x: x[1])
rdf = pd.concat([pdf.merge(cdf[['c1']], left_on='c1', right_on='c1', how='inner'),
pdf.merge(cdf[['c2']], left_on='c1', right_on='c2',how='inner')], axis=0)
print(df[~(df.c1.isin(rdf.c1) & df.len == 1)])
输出:
name first second len c1
0 CAROLINA GEORGE SCHOOL [carolina, george] [school] 2 carolina
1 CAROLINA KINDER SCHOOL [carolina, kinder] [school] 2 carolina
2 GEORGE KINDER SCHOOL [george, kinder] [school] 2 george
PS:在最终输出中,可以随意删除/忽略多余的列