我有一系列不规则列的数据,我需要使用pandas确定跨多个列的拆分部分中最常见的值。我的意思的一个例子是,如果我每天都知道我的同事在午餐时吃了什么样的奶酪:
Idx Name Cheese1 Cheese2 Cheese3
0 Evan Gouda NaN NaN
1 John Cheddar Havarti Blue
2 Evan Cheddar Gouda NaN
3 John Havarti Swiss NaN
我正在寻找能够提供最终数据透视表的某种功能:
Name Cheese Pct
Evan Gouda .66
John Havarti .4
我也不知道每次运行脚本时需要包含多少列,只是他们所有的格式" Cheese" + index。如果John第二天出现了四个奶酪,我需要添加第四列,分析脚本需要能够处理。
答案 0 :(得分:4)
import io
import pandas as pd
data = io.StringIO("""\
Idx Name Cheese1 Cheese2 Cheese3
0 Evan Gouda NaN NaN
1 John Cheddar Havarti Blue
2 Evan Cheddar Gouda NaN
3 John Havarti Swiss NaN
4 Rick NaN NaN NaN
""")
df = pd.read_csv(data, delim_whitespace=True)
def top_cheese(g):
cheese_cols = [col for col in g.columns if col.startswith('Cheese')]
try:
out = (g[cheese_cols].stack().value_counts(normalize=True)
.reset_index().iloc[0])
out.index = ['Cheese', 'Pct']
return out
except IndexError:
return pd.Series({'Cheese': 'None', 'Pct': 0})
output = df.groupby('Name').apply(top_cheese)
print(output)
输出:
Cheese Pct
Name
Evan Gouda 0.666667
John Havarti 0.400000
Rick None 0.000000
答案 1 :(得分:0)
最近,我一直在使用R
,我会解决这个问题:
library(data.table)
library(dplyr)
library(tidyr)
x <- fread('
Idx Name Cheese1 Cheese2 Cheese3
0 Evan Gouda NaN NaN
1 John Cheddar Havarti Blue
2 Evan Cheddar Gouda NaN
3 John Havarti Swiss NaN', na = 'NaN')
gather(x, , Cheese, matches('Cheese'), na.rm = T) %>%
group_by(Name, Cheese) %>%
summarise(n = n()) %>%
group_by(Name) %>%
mutate(p = n/sum(n)) %>%
filter(p == max(p)) %>%
select(-n)
哪个输出:
Name Cheese p
(chr) (chr) (dbl)
1 Evan Gouda 0.6666667
2 John Havarti 0.4000000
我很想知道Pandas会有类似的情况。这就是我想出的:
import io
import pandas as pd
x = pd.read_csv(io.StringIO('''
Idx Name Cheese1 Cheese2 Cheese3
0 Evan Gouda NaN NaN
1 John Cheddar Havarti Blue
2 Evan Cheddar Gouda NaN
3 John Havarti Swiss NaN'''), delim_whitespace=True)
tidy = pd.melt(x, ['Idx', 'Name'], value_name='Cheese').dropna()
tidy = tidy.groupby(['Name', 'Cheese']).size().reset_index(name='n')
tidy['p'] = tidy.groupby('Name')['n'].transform(lambda n: n/sum(n))
tidy[tidy['p'] == tidy.groupby('Name')['p'].transform('max')].drop('n', 1)
哪个输出:
Name Cheese p
1 Evan Gouda 0.666667
4 John Havarti 0.400000
绝对不像R
那样干净,但也许更熟悉熊猫的人可以权衡如何改善这一点。