获取在特定列中具有空值的数据并删除其他空列

时间:2019-03-28 21:10:22

标签: python pandas dataframe filter null

我有一个这样的数据框:

rawdata = {'col1': [3 ,nan ,4 ,7 ,nan ,5], 
'col2': [10 ,20 ,10 ,30 ,10 ,40], 
'col3': [23 ,34 ,45 ,56 ,34 ,23], 
'col4': [5 ,4 ,nan ,5 ,1 ,nan], 
'col5': [28 ,33 ,33 ,4 ,nan ,44]}

我想要做的是

  1. 除去nan以外的所有col4-包括列
  2. 获取col4nan的数据

最终,我需要具备以下条件:

target = {'col2': [10 ,40],
'col3': [45 ,23], 
'col4': [nan ,nan]}

代码如下:

rawdata.drop(["col1", "col5"], axis = 1, inplace= True)
rawdata = rawdata[rawdata.isnull().any(axis=1)][rawdata .columns[rawdata .isnull().any()]]

但是,这仅返回col4本身。我也需要col2和col3。

2 个答案:

答案 0 :(得分:2)

假设您可以只对包含nan的列进行硬编码(如您自己的示例所示),这可以归结为 //This happens somewhere in the main function // str_gprmc is a string thats trying to be split // "$GPRMC,130133.00,A,5741.6029848,N,01158.3855831,E,11.522,170.0,270319" for (int k = 0; k < ARRAY_SIZE(str_gprmc); k++) { char **arr = NULL; // Now do the splitting please split(str_gprmc[k], ',', &arr); }´ // and the split function int split(const char *ptr_original_string, char delimiter, char ***ptr_main_array) { // This variable holds the number of times we think we shall iterate through our array once its split int count = 1; // This variable holds the number of characters we have to iterate through for each split string int split_string_len = 1; // This variable helps us iterate through collections int i = 0; // Points to the first character of the whole string char *ptrTemp_string_holder; // Points to the first character of a split string from the main string char *t; ptrTemp_string_holder = ptr_original_string; // First we count the number of times the delimiter is occurring in the string we want to split // Iterate through the string until we reach either the Courage Return character CR : '\r', the Line Feed LF : '\n' or the NULL : '\0' while (*ptrTemp_string_holder != '\0') { if (*ptrTemp_string_holder == delimiter) count++; ptrTemp_string_holder++; } // allocate size in memory for our array. Size of a character is 1 byte * count *ptr_main_array = (char**)malloc(sizeof(char*) * count); if (*ptr_main_array == NULL) { exit(1); } ptrTemp_string_holder = ptr_original_string; // Now start iterating through the whole unsplit string as long as we're not at the end while (*ptrTemp_string_holder != '\0') { // If the pointer points to a delimiter, i.e a comma, that means we are starting to read a new string if (*ptrTemp_string_holder == delimiter) { // Now allocate a memory size for a pointer to a pointer of the new string to be built (*ptr_main_array)[i] = (char*)malloc(sizeof(char) * split_string_len); // If its null, like some GPRMC or GPHDT results that come back empty, just exit and return back to main if ((*ptr_main_array)[i] == NULL) { exit(1); } // Reset the token length and just move the hell on split_string_len = 0; i++; } ptrTemp_string_holder++; split_string_len++; } // If we are not at a delimiter however, we just allocate a size based on our token length to a pointer of a pointer // Or if you want, call it a pointer to an array (*ptr_main_array)[i] = (char*)malloc(sizeof(char) * split_string_len); // If for some unknown reason it was null, just stop the crap and return back to main...after all we got a shitty GPS device if ((*ptr_main_array)[i] == NULL) exit(1); i = 0; ptrTemp_string_holder = ptr_original_string; t = ((*ptr_main_array)[i]); // Now that we got what we need, we rebuild back everything to formulate a pointer to a pointer of character strings // I think then the rest is straight forward while (*ptrTemp_string_holder != '\0') { if (*ptrTemp_string_holder != delimiter && *ptrTemp_string_holder != '\0') { *t = *ptrTemp_string_holder; t++; } else { *t = '\0'; i++; t = ((*ptr_main_array)[i]); } ptrTemp_string_holder++; } // Free the space that was allocated to this pointer free(ptr_main_array); // We return back the number of times we need to iterate to get the split components of the original string return count; } 。带有您的测试数据:

df.drop(['col1', 'col5'], axis=1)[df.col4.isna()]

如果您不想对这些列进行硬编码,则可以采用另一种方法

In [13]: df
Out[13]:
   col1  col2  col3  col4  col5
0   3.0    10    23   5.0  28.0
1   NaN    20    34   4.0  33.0
2   4.0    10    45   NaN  33.0
3   7.0    30    56   5.0   4.0
4   NaN    10    34   1.0   NaN
5   5.0    40    23   NaN  44.0

In [14]: df.drop(['col1', 'col5'], axis=1)[df.col4.isna()]
Out[14]:
   col2  col3  col4
2    10    45   NaN
5    40    23   NaN

答案 1 :(得分:1)

我在这里假设您已使用df = pd.DataFrame(rawdata)

建立了一个数据框

我将首先构建一个包含要保留的列的系列:

keep = df.count() == len(df)
deep['col4'] = True

那么您想要的就是:

df.loc[df.col4.isna(), keep]

如预期般提供:

   col2  col3  col4
2    10    45   NaN
5    40    23   NaN

如果您要听写,那就只是df.loc[df.col4.isna(), keep].to_dict()