我有一个名为 df
的数据框数据框中的列可以进行逻辑分组。因此,我将列名称分组在列表A,B,C中,其中:
A = [column_1, column_2, column_3]
B = [column_4, column_5, column_6]
C = [column_7, column_8, column_9]
除了 column_1 列至 column_9 列之外, df 还有一个名为“ filename_ID” 的列,用作索引,因此不进行分组。 column_1 至 column_9 的列仅包含0和1值。
现在,我想过滤数据框,使其仅包含每个组(A,B,C)至少具有 个非零值的行。因此,我只想保留具有满足此条件的相应filename_ID的行。
我设法为每个组创建一个单独的数据框:
df_A = df.loc[(df[A]!=0).any(axis=1)]
df_B = df.loc[(df[B]!=0).any(axis=1)]
df_C = df.loc[(df[C]!=0).any(axis=1)]
但是,我不知道如何同时应用所有条件-即如何创建一个新的数据框,其中所有行都满足每个逻辑列组中至少有一个非零值的条件。
答案 0 :(得分:3)
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.randint(2, size=(10, 9)),
columns=[f"col{i + 1}" for i in range(9)]
)
df
col1 col2 col3 col4 col5 col6 col7 col8 col9
0 0 1 0 1 0 0 1 0 1
1 1 1 1 0 1 1 0 1 0
2 0 0 0 0 0 0 0 0 0
3 1 0 1 1 1 1 0 0 0
4 0 0 1 1 1 1 1 0 1
5 1 1 0 1 1 1 1 1 1
6 1 0 1 0 0 0 1 1 0
7 0 0 0 0 0 1 0 1 0
8 1 0 1 0 1 0 0 1 1
9 1 0 1 0 0 1 0 1 0
创建字典
m = {
**dict.fromkeys(['col1', 'col2', 'col3'], 'A'),
**dict.fromkeys(['col4', 'col5', 'col6'], 'B'),
**dict.fromkeys(['col7', 'col8', 'col9'], 'C'),
}
然后基于groupby
的{{1}}
axis=1
注意那些没有做到的
df[df.groupby(m, axis=1).any().all(1)]
col1 col2 col3 col4 col5 col6 col7 col8 col9
0 0 1 0 1 0 0 1 0 1
1 1 1 1 0 1 1 0 1 0
4 0 0 1 1 1 1 1 0 1
5 1 1 0 1 1 1 1 1 1
8 1 0 1 0 1 0 0 1 1
9 1 0 1 0 0 1 0 1 0
您也可能有这样的列:
col1 col2 col3 col4 col5 col6 col7 col8 col9
2 0 0 0 0 0 0 0 0 0
3 1 0 1 1 1 1 0 0 0
6 1 0 1 0 0 0 1 1 0
7 0 0 0 0 0 1 0 1 0
并执行相同的cols = [['col1', 'col2', 'col3'], ['col4', 'col5', 'col6'], ['col7', 'col8', 'col9']]
m = {k: v for v, c in enumerate(cols) for k in c}
答案 1 :(得分:1)
尝试以下操作:
column_groups = [A, B, C]
masks = [(df[cols] != 0).any(axis=1) for cols in column_groups]
full_mask = np.logical_and.reduce(masks)
full_df = df[full_mask]
答案 2 :(得分:1)
使用示例数据创建了一个csv文件
样本输入:
(function($){
$.fn.serializeObject = function(){
var self = this,
json = {},
push_counters = {},
patterns = {
"validate": /^[a-zA-Z][a-zA-Z0-9_]*(?:\[(?:\d*|[a-zA-Z0-9_]+)\])*$/,
"key": /[a-zA-Z0-9_]+|(?=\[\])/g,
"push": /^$/,
"fixed": /^\d+$/,
"named": /^[a-zA-Z0-9_]+$/
};
this.build = function(base, key, value){
//base[key] = isNaN(value) && Number(value) ? Number(value) : value
base[key] = value;
return base;
};
this.push_counter = function(key){
if(push_counters[key] === undefined){
push_counters[key] = 0;
}
return push_counters[key]++;
};
$.each($(this).serializeArray(), function(){
// skip invalid keys
if(!patterns.validate.test(this.name)){
return;
}
var k,
keys = this.name.match(patterns.key),
merge = this.value,
reverse_key = this.name;
while((k = keys.pop()) !== undefined){
// adjust reverse_key
reverse_key = reverse_key.replace(new RegExp("\\[" + k + "\\]$"), '');
// push
if(k.match(patterns.push)){
merge = self.build([], self.push_counter(reverse_key), merge);
}
// fixed
else if(k.match(patterns.fixed)){
merge = self.build([], k, merge);
}
// named
else if(k.match(patterns.named)){
merge = self.build({}, k, merge);
}
}
json = $.extend(true, json, merge);
});
return json;
};
})(jQuery);
输出:
ID a1 a2 a3 a4 a5 a6 a7 a8 a9
1 1 1 1 1 1 1 1 1 1
2 0 0 0 1 0 0 0 1 0
3 0 1 0 0 0 0 1 0 0
4 0 0 0 0 1 0 1 0 1
5 1 1 0 1 1 1 1 0 1
6 0 0 0 0 1 0 0 1 0
7 1 0 1 1 1 0 1 1 1
8 1 1 1 0 1 1 1 0 1
9 0 0 0 1 0 1 0 0 0
10 0 0 1 0 0 0 0 0 0
11 1 0 1 0 1 1 0 1 1
12 1 1 0 1 0 1 1 0 1
import pandas as pd
df = pd.read_csv('check.csv')
df['sumA'] = df.a1+df.a2+df.a3
df['sumB'] = df.a4+df.a5+df.a6
df['sumC'] = df.a7+df.a8+df.a9
new_df = df[(df.sumA>1)&(df.sumB>1)&(df.sumC>1)]
new_df = new_df.drop(['sumA','sumB','sumC'],axis=1)