您好,找不到关于此的任何内容,如果重复......对不起
如何删除包含相同信息的单行列值(有一些例外)
示例:
<link href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.1/css/bootstrap.min.css" rel="stylesheet"/>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/js/bootstrap.min.js"></script>
<div id="app" class="h-100">
<div id="content" class="d-flex flex-column">
<nav id="content-header" class="p-4">
<div class="navContent d-flex justify-content-between">
Navbar
</div>
</nav>
<main id="content-main" class="flex-grow-1 p-5">
Main Content
<img class="img-fluid" src="https://placeimg.com/1000/1000/any">
</main>
<div id="footer" class="p-4">
Footer Content
</div>
</div>
</div>
我想删除包含相同信息的不同名称的列,除了那些包含一些明显重复项(例如二进制列)的列之外。
输出:
Name Age Job How_Old Occupation Happy Married?
0 John 35 Dev 35 Dev True True
1 Sally 42 CA 42 CA False False
谢谢,还请注意,我需要在Massvie扁平化和标准化的json文件上执行此操作,因此循环会非常耗时。
答案 0 :(得分:3)
首先用DataFrame.select_dtypes
排除布尔列,按DataFrame.duplicated
换行并获取所有行的重复项,然后用~
反转掩码,并用Series.reindex
添加删除的布尔列,最后DataFrame.loc
过滤所有行,第:
过滤所有行,列名称屏蔽,
m = (~df.select_dtypes(exclude=bool).T.duplicated()).reindex(df.columns, fill_value=True)
另一个想法是将值转换为元组并调用Series.duplicated
:
m = ((~df.select_dtypes(exclude=bool).apply(tuple).duplicated())
.reindex(df.columns, fill_value=True))
df = df.loc[:, m]
print (df)
Name Age Job Happy Married?
0 John 35 Dev True True
1 Sally 42 CA False False
详细信息:
#exlude boolean columns
print (df.select_dtypes(exclude=bool))
Name Age Job How_Old Occupation
0 John 35 Dev 35 Dev
1 Sally 42 CA 42 CA
#transpose
print (df.select_dtypes(exclude=bool).T)
0 1
Name John Sally
Age 35 42
Job Dev CA
How_Old 35 42
Occupation Dev CA
#checked duplicates per all columns
print (df.select_dtypes(exclude=bool).T.duplicated())
Name False
Age False
Job False
How_Old True
Occupation True
#inverse mask True->False, False->True
print ((~df.select_dtypes(exclude=bool).T.duplicated()))
Name True
Age True
Job True
How_Old False
Occupation False
dtype: bool
#added removed boolean columns with Trues
print ((~df.select_dtypes(exclude=bool).T.duplicated())
.reindex(df.columns, fill_value=True))
Name True
Age True
Job True
How_Old False
Occupation False
Happy True
Married? True
dtype: bool
答案 1 :(得分:0)
定义以下函数,返回要删除的列名列表:
def chkColToDel(df):
# Column names excluding bool columns
cols = df.select_dtypes(exclude=bool).columns.tolist()
colsToDel = []
while len(cols) > 1:
cn1 = cols.pop(0) # Column name, left side
if cn1 not in colsToDel: # Not marked for deletion earlier
c1 = df[cn1] # The column itself
t1 = c1.dtype.name # Type name
for cn2 in cols: # Check remaining columns
c2 = df[cn2] # Column name, right side
if t1 == c2.dtype.name and c1.equals(c2):
# Same types and equal values
colsToDel.append(cn2) # Mark for deletion
return colsToDel
然后称呼它:
colsToDel = chkColToDel(df)
剩下的唯一事情就是删除返回的列(如果有的话):
if len(colsToDel) > 0:
df.drop(columns=colsToDel, inplace=True)
我认为您的帖子中提到的一些例外实际上是指 到 bool 列。如果例外清单更广泛,请更改我的 相应的代码。