我有一组基本上是问题答案的值,但是当我计算答案时,我想假装通过给出答案权重来更好地分配答案。以下是显示简单示例的代码:
from pprint import pprint
q1 = [
'blue',
'orange',
'red',
]
q2 = [
'male',
'female',
]
q3 = [
'18-25',
'26-30',
'31-40',
'41+'
]
data = [
{'q1': 1, 'q2': 1, 'q3': 0}, # orange, female, 18-25
{'q1': 0, 'q2': 1, 'q3': 0}, # blue, female, 18-25
{'q1': 1, 'q2': 0, 'q3': 0}, # orange, male, 18-25
{'q1': 2, 'q2': 1, 'q3': 1}, # red, female, 26-30
{'q1': 2, 'q2': 1, 'q3': 1}, # red, female, 26-30
{'q1': 1, 'q2': 0, 'q3': 1}, # orange, male, 18-25
]
counts = {
'q1': {},
'q2': {},
'q3': {}
}
respondent_value = 1
for respondent in data:
q1_val = q1[respondent['q1']]
q2_val = q2[respondent['q2']]
q3_val = q3[respondent['q3']]
if q1_val not in counts['q1']:
counts['q1'][q1_val] = 0
counts['q1'][q1_val] += respondent_value
if q2_val not in counts['q2']:
counts['q2'][q2_val] = 0
counts['q2'][q2_val] += 1
if q3_val not in counts['q3']:
counts['q3'][q3_val] = 0
counts['q3'][q3_val] += respondent_value
pprint(counts)
目前将打印以下值:
{'q1': {'blue': 1, 'orange': 3, 'red': 2},
'q2': {'female': 4, 'male': 2},
'q3': {'18-25': 3, '26-30': 3}}
我想假装我有以下人口统计数据:
根据我想要表示的内容,如何自动为此数据生成权重?对于与人口统计不匹配的给定值,我只假设权重为1.
我对使用pandas / numpy感兴趣,如果它们有用,但会使用最好的工具。
对于单值加权,我可能会这样做(我需要多个变量):
from pprint import pprint
q1 = [
'blue',
'orange',
'red',
]
q2 = [
'male',
'female',
]
q3 = [
'18-25',
'26-30',
'31-40',
'41+'
]
data = [
{'q1': 1, 'q2': 1, 'q3': 0}, # orange, female, 18-25
{'q1': 0, 'q2': 1, 'q3': 0}, # blue, female, 18-25
{'q1': 1, 'q2': 0, 'q3': 0}, # orange, male, 18-25
{'q1': 2, 'q2': 1, 'q3': 1}, # red, female, 26-30
{'q1': 2, 'q2': 1, 'q3': 1}, # red, female, 26-30
{'q1': 1, 'q2': 0, 'q3': 1}, # orange, male, 18-25
]
def get_counts(male_weight, female_weight):
counts = {
'q1': {},
'q2': {},
'q3': {}
}
for respondent in data:
q1_val = q1[respondent['q1']]
q2_val = q2[respondent['q2']]
q3_val = q3[respondent['q3']]
if q2_val == 'female':
respondent_value = female_weight
else:
respondent_value = male_weight
if q1_val not in counts['q1']:
counts['q1'][q1_val] = 0
counts['q1'][q1_val] += respondent_value
if q2_val not in counts['q2']:
counts['q2'][q2_val] = 0
counts['q2'][q2_val] += respondent_value
if q3_val not in counts['q3']:
counts['q3'][q3_val] = 0
counts['q3'][q3_val] += respondent_value
return counts
total_respondents = len(data) * 1.0
counts = get_counts(1, 1)
print("Starting counts")
print("=================")
pprint(counts)
print("\n")
female_pop = 50
male_pop = 50
sample_females = (counts['q2']['female'] / total_respondents) * 100
sample_males = (counts['q2']['male'] / total_respondents) * 100
female_weight = female_pop / sample_females
male_weight = male_pop / sample_males
weighted_counts = get_counts(male_weight, female_weight)
print("Weighted Counts")
print("===============")
pprint(weighted_counts)
答案 0 :(得分:0)
如果我理解正确,您需要分配颜色的响应,但是您希望对样本中代表性不足的年龄和性别组给予更多权重。例如,如果女性的回答率是男性的两倍,那么您希望男性的答案是男性的两倍。如果这是正确的,这是一个使用pandas的方法:
In [70]: df = pd.DataFrame(dict(color=["orange","blue","orange","red","red","orange"],gender=["female","female","male","female","female","male"], age=["18-25", "18-25", "18-25", "26-30", "26-30", "18-25"]))
In [71]: gender_dist = pd.Series([.5,.5], index=["female","male"])
In [72]: age_dist = pd.Series([.4,.6], index=["18-25","26-30"])
计算您为实现目标年龄/性别分布而需要应用的权重:
In [73]: gender_weights = gender_dist / df.gender.groupby(df.gender).count()
In [74]: age_weights = age_dist / df.age.groupby(df.age).count()
In [75]: age_weights
Out[75]:
18-25 0.1
26-30 0.3
dtype: float64
透过样本数据,按年龄和性别获取每种颜色的计数:
In [76]: df["value"] = 1
In [77]: pivoted = pd.pivot_table(df, values="value", columns="color", index=["gender","age"], aggfunc="count", fill_value=0)
In [78]: pivoted
Out[78]:
color blue orange red
gender age
female 18-25 1 1 0
26-30 0 0 2
male 18-25 0 2 0
重新索引权重以与数据透视表索引对齐:
In [79]: index=pd.MultiIndex.from_product([gender_weights.index, age_weights.index], names=["gender","age"])
In [80]: gender_weights = gender_weights.reindex(index, level=0)
In [81]: age_weights = age_weights.reindex(index, level=1)
In [82]: age_weights
Out[82]:
gender age
female 18-25 0.1
26-30 0.3
male 18-25 0.1
26-30 0.3
dtype: float64
乘以权重:
In [83]: weighted_counts = pivoted.mul(age_weights, axis=0).mul(gender_weights, axis=0)
In [84]: weighted_counts
Out[84]:
color blue orange red
gender age
female 18-25 0.0125 0.0125 0.000
26-30 0.0000 0.0000 0.075
male 18-25 0.0000 0.0500 0.000
26-30 NaN NaN NaN
获取加权分布,然后将其标准化:
In [85]: dist = weighted.sum()
In [86]: dist / dist.sum()
Out[86]:
color
blue 0.083333
orange 0.416667
red 0.500000
dtype: float64