根据列值删除行中的重复项

时间:2020-04-09 12:46:25

标签: python pandas

您好,找不到关于此的任何内容,如果重复......对不起

如何删除包含相同信息的单行列值(有一些例外)

示例:

<link href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.1/css/bootstrap.min.css" rel="stylesheet"/>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/js/bootstrap.min.js"></script>

<div id="app" class="h-100">
  <div id="content" class="d-flex flex-column">
    <nav id="content-header" class="p-4">
      <div class="navContent d-flex justify-content-between">
        Navbar
      </div>
    </nav>
    <main id="content-main" class="flex-grow-1 p-5">
      Main Content
      <img class="img-fluid" src="https://placeimg.com/1000/1000/any">
    </main>
    <div id="footer" class="p-4">
      Footer Content
    </div>
  </div>
</div>

我想删除包含相同信息的不同名称的列,除了那些包含一些明显重复项(例如二进制列)的列之外。

输出:

      Name     Age     Job    How_Old    Occupation   Happy   Married?
 0    John     35      Dev    35         Dev          True    True
 1    Sally    42      CA     42         CA           False   False

谢谢,还请注意,我需要在Massvie扁平化和标准化的json文件上执行此操作,因此循环会非常耗时。

2 个答案:

答案 0 :(得分:3)

首先用DataFrame.select_dtypes排除布尔列,按DataFrame.duplicated换行并获取所有行的重复项,然后用~反转掩码,并用Series.reindex添加删除的布尔列,最后DataFrame.loc过滤所有行,第:过滤所有行,列名称屏蔽,

m = (~df.select_dtypes(exclude=bool).T.duplicated()).reindex(df.columns, fill_value=True)

另一个想法是将值转换为元组并调用Series.duplicated

m = ((~df.select_dtypes(exclude=bool).apply(tuple).duplicated())
         .reindex(df.columns, fill_value=True))

df = df.loc[:, m]
print (df)
    Name  Age  Job  Happy  Married?
0   John   35  Dev   True      True
1  Sally   42   CA  False     False

详细信息

#exlude boolean columns
print (df.select_dtypes(exclude=bool))
    Name  Age  Job  How_Old Occupation
0   John   35  Dev       35        Dev
1  Sally   42   CA       42         CA

#transpose
print (df.select_dtypes(exclude=bool).T)
               0      1
Name        John  Sally
Age           35     42
Job          Dev     CA
How_Old       35     42
Occupation   Dev     CA

#checked duplicates per all columns
print (df.select_dtypes(exclude=bool).T.duplicated())
Name          False
Age           False
Job           False
How_Old        True
Occupation     True

#inverse mask True->False, False->True
print ((~df.select_dtypes(exclude=bool).T.duplicated()))
Name           True
Age            True
Job            True
How_Old       False
Occupation    False
dtype: bool

#added removed boolean columns with Trues
print ((~df.select_dtypes(exclude=bool).T.duplicated())
           .reindex(df.columns, fill_value=True))
Name           True
Age            True
Job            True
How_Old       False
Occupation    False
Happy          True
Married?       True
dtype: bool

答案 1 :(得分:0)

定义以下函数,返回要删除的列名列表:

def chkColToDel(df):
    # Column names excluding bool columns
    cols = df.select_dtypes(exclude=bool).columns.tolist()
    colsToDel = []
    while len(cols) > 1:
        cn1 = cols.pop(0)        # Column name, left side
        if cn1 not in colsToDel: # Not marked for deletion earlier
            c1 = df[cn1]         # The column itself
            t1 = c1.dtype.name   # Type name
            for cn2 in cols:     # Check remaining columns
                c2 = df[cn2]     # Column name, right side
                if t1 == c2.dtype.name and c1.equals(c2):
                    # Same types and equal values
                    colsToDel.append(cn2) # Mark for deletion
    return colsToDel

然后称呼它:

colsToDel = chkColToDel(df)

剩下的唯一事情就是删除返回的列(如果有的话):

if len(colsToDel) > 0:
    df.drop(columns=colsToDel, inplace=True)

我认为您的帖子中提到的一些例外实际上是指 到 bool 列。如果例外清单更广泛,请更改我的 相应的代码。