Question

我有一个名为df的熊猫DataFrame：

df = { 
     'a' : [1, NaN, 2, NaN] },
     ...
     'b' : [1, 5, 6, 6] 
}

我想要一个元组列表，每个元组包含：(column_name, #_non_null values_for_that_column)

使用df.info()，我可以看到非null s的数量。我想以编程方式遍历列名和非空值计数，其方式类似于我遍历df或dict的方式：

for column_name, non_null_count in ?:
    ...

如何从df.info()通话中获取此信息？ **注意：我知道如何从DataFrame中获取此信息，特别是对df.info()返回值感到好奇。

Answer 1

将pd.DataFrame.isnull与pd.Series.items结合使用：

df = pd.DataFrame({'a': [1, np.nan, 2, np.nan],
                   'b': [1, 5, 6, 6]})

res = list(df.isnull().sum().items())
# [('a', 2), ('b', 0)]

Answer 2

答案：使用字符串缓冲区（io包）加载.info（）返回的对象。加载后，基本的python操作即可满足您的需求。

代码：

# Buffer functionality
import io
# Regular expression functionality
import re

buffer = io.StringIO()
df.info(buf=buffer)

# If you look at the output, the first 3 lines and the last 2 lines describe the output. There will be one trailing '' (hence -3).
# Shrink multiple spaces into one space, to be guaranteed that each split value is as such: split_arr[0] == column_name, split_arr[1] == non_null_count 
tuple_array = [
    (re.sub(' +', ' ', val).split(' ')[0], re.sub(' +', ' ', val).split(' ')[1]) 
    for val in buffer.getvalue().split('\n')[3:-3]
]

输出：上面的示例DataFrame的输出如下所示。请注意，此代码可以应用于任何df.info()调用。

tuple_array = [
     ('a', '2'),
     ...
     ('b', '4')
]

从df.info（）pandas获取列名和空值计数

2 个答案: