我有一个带有+ 20K行的示例pandas.DataFrame
,格式如下:
import pandas as pd
import numpy as np
data = {"first_column": ["A", "B", "B", "B", "C", "A", "A", "A", "D", "B", "A", "A"],
"second_column": [0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0]}
df = pd.DataFrame(data)
>>> df
first_column second_column
0 A 0
1 B 1
2 B 1
3 B 1
4 C 0
5 A 0
6 A 0
7 A 1
8 D 1
9 B 1
10 A 1
11 A 0
....
列first_column
包含每行A
,B
,C
和D
。在第二列中,有一个表示一组值的二进制标签。所有1的连续分组都是唯一的“组”,例如第1-3行是一组,第7-10行是另一组。
我想通过“AB”(该组仅由A或B组成),“CD”(该组仅由C或D组成)或“标记”这些组中的每一组。混合“(如果有混合物,例如所有B和一个C)。知道“如何”混合这些分组中的一些百分比,即AB在总标签中的百分比也是有用的。因此,如果它仅为A
或B
,则标识应为AB
。如果仅为C
或D
,则标识应为CD
。它是A,B,C和/或D的混合物,然后它是mixed
。百分比是(AB行数#)/(总行数#)
以下是DataFrame
的结果:
>>> df
first_column second_column identity percent
0 A 0 0 0
1 B 1 AB 1.0
2 B 1 AB 1.0
3 B 1 AB 1.0
4 C 0 0 0
5 A 0 0 0
6 A 0 0 0
7 A 1 mixed 0.75 # 3/4, 3-AB, 4-total
8 D 1 mixed 0.75
9 B 1 mixed 0.75
10 A 1 mixed 0.75
11 A 0 0 0
....
我最初的想法是首先尝试将df.loc()
与
if (df.first_column == "A" | df.first_column == "B"):
df.loc[df.second_column == 1, "identity"] = "AB"
if (df.first_column == "C" | df.first_column == "D"):
df.loc[df.second_column == 1, "identity"] = "CD"
但这并未考虑混合物,也不适用于孤立的分组。
答案 0 :(得分:4)
这是一种方法。
<强>代码:强>
import pandas as pd
from collections import Counter
a_b = set('AB')
c_d = set('CD')
def get_id_percent(group):
present = Counter(group['first_column'])
present_set = set(present.keys())
if group['second_column'].iloc[0] == 0:
ret_val = 0, 0
elif present_set.issubset(a_b) and len(present_set) == 1:
ret_val = 'AB', 0
elif present_set.issubset(c_d) and len(present_set) == 1:
ret_val = 'CD', 0
else:
ret_val = 'mixed', \
float(present['A'] + present['B']) / len(group)
return pd.DataFrame(
[ret_val] * len(group), columns=['identity', 'percent'])
测试代码:
data = {"first_column": ["A", "B", "B", "B", "C", "A", "A",
"A", "D", "B", "A", "A"],
"second_column": [0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0]}
df = pd.DataFrame(data)
groupby = df.groupby((df.second_column != df.second_column.shift()).cumsum())
results = groupby.apply(get_id_percent).reset_index()
results = results.drop(['second_column', 'level_1'], axis=1)
df = pd.concat([df, results], axis=1)
print(df)
<强>结果:强>
first_column second_column identity percent
0 A 0 0 0.00
1 B 1 AB 0.00
2 B 1 AB 0.00
3 B 1 AB 0.00
4 C 0 0 0.00
5 A 0 0 0.00
6 A 0 0 0.00
7 A 1 mixed 0.75
8 D 1 mixed 0.75
9 B 1 mixed 0.75
10 A 1 mixed 0.75
11 A 0 0 0.00
答案 1 :(得分:1)
这是一种方法:
import pandas as pd
# generate example data
data = {"first_column": ["A", "B", "B", "B", "C", "A", "A", "A", "D", "B", "A", "A"],
"second_column": [0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0]}
df = pd.DataFrame(data)
# these are intermediary groups for computation
df['group_type'] = None
df['ct'] = 0
def find_border(x, ct):
''' finds and labels lettered groups '''
ix = x.name
# does second_column == 1?
if x.second_column:
# if it's the start of a group...
if (not ix) | (not df.group_type[ix-1]):
df.ix[ix,'group_type'] = x.first_column
df.ix[ix,'ct'] += 1
return
# if it's the end of a group
elif (not df.second_column[ix+1]):
df.ix[ix,'group_type'] = df.group_type[ix-1] + x.first_column
df.ix[ix,'ct'] = df.ct[ix-1] + 1
for i in range(df.ct[ix-1]+1):
df.ix[ix-i,'group_type'] = df.ix[ix,'group_type']
df.ix[ix,'ct'] = 0
return
# if it's the middle of a group
else:
df.ix[ix,'ct'] = df.ct[ix-1] + 1
df.ix[ix,'group_type'] = df.group_type[ix-1] + x.first_column
return
return
# compute group membership
_=df.apply(find_border, axis='columns', args=(0,))
def determine_id(x):
if not x:
return '0'
if list(set(x)) in [['A'],['B'],['A','B']]:
return 'AB'
elif list(set(x)) in [['C'],['D'],['C','D']]:
return 'CD'
else:
return 'mixed'
def determine_pct(x):
if not x:
return 0
return sum([1 for letter in x if letter in ['A','B']]) / float(len(x))
# determine row identity
df['identity'] = df.group_type.apply(determine_id)
# determine % of A or B in group
df['percent'] = df.group_type.apply(determine_pct)
输出:
first_column second_column identity percent
0 A 0 0 0.00
1 B 1 AB 1.00
2 B 1 AB 1.00
3 B 1 AB 1.00
4 C 0 0 0.00
5 A 0 0 0.00
6 A 0 0 0.00
7 A 1 mixed 0.75
8 D 1 mixed 0.75
9 B 1 mixed 0.75
10 A 1 mixed 0.75
11 A 0 0 0.00