基于变量的加权值

时间:2015-02-23 16:56:12

标签: python numpy pandas statistics scipy

我有一组基本上是问题答案的值,但是当我计算答案时,我想假装通过给出答案权重来更好地分配答案。以下是显示简单示例的代码:

from pprint import pprint

q1 = [
    'blue',
    'orange',
    'red',
]

q2 = [
    'male',
    'female',
]

q3 = [
    '18-25',
    '26-30',
    '31-40',
    '41+'
]

data = [
    {'q1': 1, 'q2': 1, 'q3': 0},  # orange, female, 18-25
    {'q1': 0, 'q2': 1, 'q3': 0},  # blue, female, 18-25
    {'q1': 1, 'q2': 0, 'q3': 0},  # orange, male, 18-25
    {'q1': 2, 'q2': 1, 'q3': 1},  # red, female, 26-30
    {'q1': 2, 'q2': 1, 'q3': 1},  # red, female, 26-30
    {'q1': 1, 'q2': 0, 'q3': 1},  # orange, male, 18-25
]

counts = {
    'q1': {},
    'q2': {},
    'q3': {}
}

respondent_value = 1

for respondent in data:
    q1_val = q1[respondent['q1']]
    q2_val = q2[respondent['q2']]
    q3_val = q3[respondent['q3']]

    if q1_val not in counts['q1']:
        counts['q1'][q1_val] = 0

    counts['q1'][q1_val] += respondent_value

    if q2_val not in counts['q2']:
        counts['q2'][q2_val] = 0

    counts['q2'][q2_val] += 1

    if q3_val not in counts['q3']:
        counts['q3'][q3_val] = 0

    counts['q3'][q3_val] += respondent_value

pprint(counts)

目前将打印以下值:

{'q1': {'blue': 1, 'orange': 3, 'red': 2},
 'q2': {'female': 4, 'male': 2},
 'q3': {'18-25': 3, '26-30': 3}}

我想假装我有以下人口统计数据:

  • 50%男性
  • 50%女性
  • 40%18-15
  • 60%26-30

根据我想要表示的内容,如何自动为此数据生成权重?对于与人口统计不匹配的给定值,我只假设权重为1.

我对使用pandas / numpy感兴趣,如果它们有用,但会使用最好的工具。

对于单值加权,我可能会这样做(我需要多个变量):

from pprint import pprint

q1 = [
    'blue',
    'orange',
    'red',
]

q2 = [
    'male',
    'female',
]

q3 = [
    '18-25',
    '26-30',
    '31-40',
    '41+'
]

data = [
    {'q1': 1, 'q2': 1, 'q3': 0},  # orange, female, 18-25
    {'q1': 0, 'q2': 1, 'q3': 0},  # blue, female, 18-25
    {'q1': 1, 'q2': 0, 'q3': 0},  # orange, male, 18-25
    {'q1': 2, 'q2': 1, 'q3': 1},  # red, female, 26-30
    {'q1': 2, 'q2': 1, 'q3': 1},  # red, female, 26-30
    {'q1': 1, 'q2': 0, 'q3': 1},  # orange, male, 18-25
]


def get_counts(male_weight, female_weight):
    counts = {
        'q1': {},
        'q2': {},
        'q3': {}
    }

    for respondent in data:
        q1_val = q1[respondent['q1']]
        q2_val = q2[respondent['q2']]
        q3_val = q3[respondent['q3']]

        if q2_val == 'female':
            respondent_value = female_weight
        else:
            respondent_value = male_weight

        if q1_val not in counts['q1']:
            counts['q1'][q1_val] = 0

        counts['q1'][q1_val] += respondent_value

        if q2_val not in counts['q2']:
            counts['q2'][q2_val] = 0

        counts['q2'][q2_val] += respondent_value

        if q3_val not in counts['q3']:
            counts['q3'][q3_val] = 0

        counts['q3'][q3_val] += respondent_value

    return counts

total_respondents = len(data) * 1.0
counts = get_counts(1, 1)
print("Starting counts")
print("=================")
pprint(counts)
print("\n")

female_pop = 50
male_pop = 50

sample_females = (counts['q2']['female'] / total_respondents) * 100
sample_males = (counts['q2']['male'] / total_respondents) * 100

female_weight = female_pop / sample_females
male_weight = male_pop / sample_males

weighted_counts = get_counts(male_weight, female_weight)
print("Weighted Counts")
print("===============")
pprint(weighted_counts)

1 个答案:

答案 0 :(得分:0)

如果我理解正确,您需要分配颜色的响应,但是您希望对样本中代表性不足的年龄和性别组给予更多权重。例如,如果女性的回答率是男性的两倍,那么您希望男性的答案是男性的两倍。如果这是正确的,这是一个使用pandas的方法:

In [70]: df = pd.DataFrame(dict(color=["orange","blue","orange","red","red","orange"],gender=["female","female","male","female","female","male"], age=["18-25", "18-25", "18-25", "26-30", "26-30", "18-25"]))

In [71]: gender_dist = pd.Series([.5,.5], index=["female","male"])

In [72]: age_dist = pd.Series([.4,.6], index=["18-25","26-30"])

计算您为实现目标年龄/性别分布而需要应用的权重:

In [73]: gender_weights = gender_dist / df.gender.groupby(df.gender).count()

In [74]: age_weights = age_dist / df.age.groupby(df.age).count()

In [75]: age_weights
Out[75]: 
18-25    0.1
26-30    0.3
dtype: float64

透过样本数据,按年龄和性别获取每种颜色的计数:

In [76]: df["value"] = 1

In [77]: pivoted = pd.pivot_table(df, values="value", columns="color", index=["gender","age"], aggfunc="count", fill_value=0)

In [78]: pivoted
Out[78]: 
color         blue  orange  red
gender age                     
female 18-25     1       1    0
       26-30     0       0    2
male   18-25     0       2    0

重新索引权重以与数据透视表索引对齐:

In [79]: index=pd.MultiIndex.from_product([gender_weights.index, age_weights.index], names=["gender","age"])

In [80]: gender_weights = gender_weights.reindex(index, level=0)

In [81]: age_weights = age_weights.reindex(index, level=1)

In [82]: age_weights
Out[82]: 
gender  age  
female  18-25    0.1
        26-30    0.3
male    18-25    0.1
        26-30    0.3
dtype: float64

乘以权重:

In [83]: weighted_counts = pivoted.mul(age_weights, axis=0).mul(gender_weights, axis=0)

In [84]: weighted_counts
Out[84]: 
color           blue  orange    red
gender age                         
female 18-25  0.0125  0.0125  0.000
       26-30  0.0000  0.0000  0.075
male   18-25  0.0000  0.0500  0.000
       26-30     NaN     NaN    NaN

获取加权分布,然后将其标准化:

In [85]: dist = weighted.sum()

In [86]: dist / dist.sum()
Out[86]: 
color
blue      0.083333
orange    0.416667
red       0.500000
dtype: float64