根据列表值从数据框中删除行

时间:2020-11-12 18:22:49

标签: python pandas dataframe tree

场景有点复杂。请让我开始解释:

有一个如下数据框:

import pandas as pd

data = [['CAROLINA GEORGE SCHOOL',['carolina','george'], ['school']],
        ['CAROLINA KINDER SCHOOL',['carolina','kinder'],['school']],
        ['GEORGE KINDER SCHOOL',['george','kinder'],['school']],
        ['CAROLINA SCHOOL',['carolina'], ['school']],
        ['GEORGE SCHOOL',['george'],['school']],
        ['GEORGE EDUCATION',['george'],['education']]
       ]
df = pd.DataFrame(data,columns=['name','first','second'])

df['len'] = df['first'].str.len()
df.sort_values(by='len', inplace=True)

第一列对应全名,该名称分为两部分,并以列表形式存储在接下来的两列中。这个想法是找到列表的根或最基本的形式,并摆脱它。

例如,如果我们有列表['george','kinder'],['carolina','george'],则其父级为['george'],因为两个列表中都包含george。同样,如果我们有列表['carolina','george'],['carolina','kinder'],则其父级为['carolina']。

这是最基本的情况,可能是父元素可能由多个元素组成,而不只是一个元素。

这个想法是找到并摆脱它。不确定数据框是否是解决此问题的最佳方法。

基本数据框如下:

                     name               first       second  len
3         CAROLINA SCHOOL          [carolina]     [school]    1
4           GEORGE SCHOOL            [george]     [school]    1
5        GEORGE EDUCATION            [george]  [education]    1
0  CAROLINA GEORGE SCHOOL  [carolina, george]     [school]    2
1  CAROLINA KINDER SCHOOL  [carolina, kinder]     [school]    2
2    GEORGE KINDER SCHOOL    [george, kinder]     [school]    2

预期结果如下:

                     name               first       second  len
5        GEORGE EDUCATION            [george]  [education]    1
0  CAROLINA GEORGE SCHOOL  [carolina, george]     [school]    2
1  CAROLINA KINDER SCHOOL  [carolina, kinder]     [school]    2
2    GEORGE KINDER SCHOOL    [george, kinder]     [school]    2

请注意,GEORGE EDUCATION行仍然存在,因为second列值具有一个值为education而不是school的列表。因此,它将仅删除second列中具有相同值的父对象。

谢谢

1 个答案:

答案 0 :(得分:0)

您需要将first列分解为c1c2;然后合并潜在的父母和孩子的名字。

这是工作代码:

import pandas as pd

data = [['CAROLINA GEORGE SCHOOL',['carolina','george'], ['school']],
        ['CAROLINA KINDER SCHOOL',['carolina','kinder'],['school']],
        ['GEORGE KINDER SCHOOL',['george','kinder'],['school']],
        ['CAROLINA SCHOOL',['carolina'], ['school']],
        ['GEORGE SCHOOL',['george'],['school']],
        ['GEORGE EDUCATION',['george'],['education']]
       ]
df = pd.DataFrame(data,columns=['name','first','second'])

df['len'] = df['first'].str.len()
df.sort_values(by='len', inplace=True)
df['c1'] = df['first'].transform(lambda x: x[0])

pdf = df[df.len == 1]
cdf = df[df.len == 2]

cdf.loc[:,'c2'] = cdf['first'].transform(lambda x: x[1])

rdf = pd.concat([pdf.merge(cdf[['c1']], left_on='c1', right_on='c1', how='inner'),
                 pdf.merge(cdf[['c2']], left_on='c1', right_on='c2',how='inner')], axis=0)

print(df[~(df.c1.isin(rdf.c1) & df.len == 1)])

输出:

                     name               first    second  len        c1
0  CAROLINA GEORGE SCHOOL  [carolina, george]  [school]    2  carolina
1  CAROLINA KINDER SCHOOL  [carolina, kinder]  [school]    2  carolina
2    GEORGE KINDER SCHOOL    [george, kinder]  [school]    2    george

PS:在最终输出中,可以随意删除/忽略多余的列