Question

我有一个200列的DF。他们中的大多数都是NaN的。我想选择没有NaN或至少具有最小NaN的所有列。我试图放弃所有阈值或notnull（）但没有成功。任何想法。

df.dropna(thresh=2, inplace=True)
df_notnull = df[df.notnull()]

例如，

DF：

col1  col2 col3
23     45  NaN
54     39  NaN
NaN    45  76
87     32  NaN

输出应如下所示：

 df.dropna(axis=1, thresh=2)

    col1  col2
    23     45  
    54     39  
    NaN    45  
    87     32

Answer 1

您可以使用

创建非NaN列

df = df[df.columns[~df.isnull().all()]]

或

null_cols = df.columns[df.isnull().all()]
df.drop(null_cols, axis = 1, inplace = True)

如果您希望根据一定百分比的NaN删除列，请将数据超过90％的列称为null

cols_to_delete = df.columns[df.isnull().sum()/len(df) > .90]
df.drop(cols_to_delete, axis = 1, inplace = True)

Answer 2

我认为你不能在没有任何NaN的情况下获得所有列。如果是这种情况，您可以先使用~col.isnull.any()获取没有任何NaN的列名称，然后使用您的列。

我可以在以下代码中思考：

import pandas as pd

df = pd.DataFrame({
    'col1': [23, 54, pd.np.nan, 87],
    'col2': [45, 39, 45, 32],
    'col3': [pd.np.nan, pd.np.nan, 76, pd.np.nan,]
})

# This function will check if there is a null value in the column
def has_nan(col, threshold=0):
    return col.isnull().sum() > threshold

# Then you apply the "complement" of function to get the column with
# no NaN.

df.loc[:, ~df.apply(has_nan)]

# ... or pass the threshold as parameter, if needed
df.loc[:, ~df.apply(has_nan, args=(2,))]

Answer 3

你应该试试df_notnull = df.dropna(how='all') 这将只获得非空行。

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

Answer 4

null_series = df.isnull().sum() # The number of missing values from each column in your dataframe
full_col_series = null_series[null_series == 0] # Will keep only the columns with no missing values

df = df[full_col_series.index]

Answer 5

df [df.columns [〜df.isnull（）。任何（）]]将为您提供一个DataFrame，其中仅包含无空值的列，这应该是解决方案。< / p>

df [df.columns [〜df.isnull（）。全部（）]]仅删除除空值外没有其他值的列，甚至保留一个非空值的列。< / p>

df.isnull（）将返回与df形状相同的布尔数据框。如果特定值为null，则这些布尔值将为True；否则，则为False。

df.isnull（）。any（）将为所有具有零的列返回True。这是我与公认的答案有所不同的地方，因为 df.isnull（）。all（）不会标记带有甚至一个值的列！

Answer 6

这是一个简单的函数，您可以通过传递数据帧和阈值直接使用

df
'''
     pets   location     owner     id
0     cat  San_Diego     Champ  123.0
1     dog        NaN       Ron    NaN
2     cat        NaN     Brick    NaN
3  monkey        NaN     Champ    NaN
4  monkey        NaN  Veronica    NaN
5     dog        NaN      John    NaN
'''

def rmissingvaluecol(dff,threshold):
    l = []
    l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
    print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
    print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
    return l


rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values

#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
 ['id', 'location']
'''

现在创建不包含这些列的新数据框

l = rmissingvaluecol(df,1)
df1 = df[l]

PS：您可以根据需要更改阈值

奖励步骤

您可以找到每列缺失值的百分比（可选）

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))

missing(df)

#output
'''
id          83.33
location    83.33
owner        0.00
pets         0.00
dtype: float64
'''

Answer 7

这对我来说效果很好，并且可能还根据您的需求量身定制了！

def nan_weed(df,thresh):
ind = []
i = df.shape[1]
for j in range(0,i-1):
    if df[j].isnull().sum() <= thresh:
        ind.append(j)
return df[ind]

Pandas选择没有NaN的所有列

7 个答案:

这是一个简单的函数，您可以通过传递数据帧和阈值直接使用

奖励步骤