我有一个这样的数据框:
rawdata = {'col1': [3 ,nan ,4 ,7 ,nan ,5],
'col2': [10 ,20 ,10 ,30 ,10 ,40],
'col3': [23 ,34 ,45 ,56 ,34 ,23],
'col4': [5 ,4 ,nan ,5 ,1 ,nan],
'col5': [28 ,33 ,33 ,4 ,nan ,44]}
我想要做的是
nan
以外的所有col4
-包括列col4
为nan
的数据最终,我需要具备以下条件:
target = {'col2': [10 ,40],
'col3': [45 ,23],
'col4': [nan ,nan]}
代码如下:
rawdata.drop(["col1", "col5"], axis = 1, inplace= True)
rawdata = rawdata[rawdata.isnull().any(axis=1)][rawdata .columns[rawdata .isnull().any()]]
但是,这仅返回col4本身。我也需要col2和col3。
答案 0 :(得分:2)
假设您可以只对包含nan的列进行硬编码(如您自己的示例所示),这可以归结为
//This happens somewhere in the main function
// str_gprmc is a string thats trying to be split
// "$GPRMC,130133.00,A,5741.6029848,N,01158.3855831,E,11.522,170.0,270319"
for (int k = 0; k < ARRAY_SIZE(str_gprmc); k++)
{
char **arr = NULL;
// Now do the splitting please
split(str_gprmc[k], ',', &arr);
}´
// and the split function
int split(const char *ptr_original_string, char delimiter, char ***ptr_main_array)
{
// This variable holds the number of times we think we shall iterate through our array once its split
int count = 1;
// This variable holds the number of characters we have to iterate through for each split string
int split_string_len = 1;
// This variable helps us iterate through collections
int i = 0;
// Points to the first character of the whole string
char *ptrTemp_string_holder;
// Points to the first character of a split string from the main string
char *t;
ptrTemp_string_holder = ptr_original_string;
// First we count the number of times the delimiter is occurring in the string we want to split
// Iterate through the string until we reach either the Courage Return character CR : '\r', the Line Feed LF : '\n' or the NULL : '\0'
while (*ptrTemp_string_holder != '\0')
{
if (*ptrTemp_string_holder == delimiter)
count++;
ptrTemp_string_holder++;
}
// allocate size in memory for our array. Size of a character is 1 byte * count
*ptr_main_array = (char**)malloc(sizeof(char*) * count);
if (*ptr_main_array == NULL) {
exit(1);
}
ptrTemp_string_holder = ptr_original_string;
// Now start iterating through the whole unsplit string as long as we're not at the end
while (*ptrTemp_string_holder != '\0')
{
// If the pointer points to a delimiter, i.e a comma, that means we are starting to read a new string
if (*ptrTemp_string_holder == delimiter)
{
// Now allocate a memory size for a pointer to a pointer of the new string to be built
(*ptr_main_array)[i] = (char*)malloc(sizeof(char) * split_string_len);
// If its null, like some GPRMC or GPHDT results that come back empty, just exit and return back to main
if ((*ptr_main_array)[i] == NULL)
{
exit(1);
}
// Reset the token length and just move the hell on
split_string_len = 0;
i++;
}
ptrTemp_string_holder++;
split_string_len++;
}
// If we are not at a delimiter however, we just allocate a size based on our token length to a pointer of a pointer
// Or if you want, call it a pointer to an array
(*ptr_main_array)[i] = (char*)malloc(sizeof(char) * split_string_len);
// If for some unknown reason it was null, just stop the crap and return back to main...after all we got a shitty GPS device
if ((*ptr_main_array)[i] == NULL) exit(1);
i = 0;
ptrTemp_string_holder = ptr_original_string;
t = ((*ptr_main_array)[i]);
// Now that we got what we need, we rebuild back everything to formulate a pointer to a pointer of character strings
// I think then the rest is straight forward
while (*ptrTemp_string_holder != '\0')
{
if (*ptrTemp_string_holder != delimiter && *ptrTemp_string_holder != '\0')
{
*t = *ptrTemp_string_holder;
t++;
}
else
{
*t = '\0';
i++;
t = ((*ptr_main_array)[i]);
}
ptrTemp_string_holder++;
}
// Free the space that was allocated to this pointer
free(ptr_main_array);
// We return back the number of times we need to iterate to get the split components of the original string
return count;
}
。带有您的测试数据:
df.drop(['col1', 'col5'], axis=1)[df.col4.isna()]
如果您不想对这些列进行硬编码,则可以采用另一种方法
In [13]: df
Out[13]:
col1 col2 col3 col4 col5
0 3.0 10 23 5.0 28.0
1 NaN 20 34 4.0 33.0
2 4.0 10 45 NaN 33.0
3 7.0 30 56 5.0 4.0
4 NaN 10 34 1.0 NaN
5 5.0 40 23 NaN 44.0
In [14]: df.drop(['col1', 'col5'], axis=1)[df.col4.isna()]
Out[14]:
col2 col3 col4
2 10 45 NaN
5 40 23 NaN
答案 1 :(得分:1)
我在这里假设您已使用df = pd.DataFrame(rawdata)
我将首先构建一个包含要保留的列的系列:
keep = df.count() == len(df)
deep['col4'] = True
那么您想要的就是:
df.loc[df.col4.isna(), keep]
如预期般提供:
col2 col3 col4
2 10 45 NaN
5 40 23 NaN
如果您要听写,那就只是df.loc[df.col4.isna(), keep].to_dict()