试图在python中创建分组变量

时间:2015-12-08 20:57:46

标签: python ipython ipython-notebook

我有一列年龄值,我需要转换为18-29岁,30-39岁,40-49岁,50-59岁,60-69岁和70岁以下的年龄段:

有关df'file'中某些数据的示例,我有:

enter image description here

并希望:

enter image description here

我尝试了以下内容:

file['agerange'] = file[['age']].apply(lambda x: "18-29" if (x[0] > 16
                                       or x[0] < 30) else "other")

我不愿意只做一个群组,因为水桶尺寸不统一但我会对此作出解决方案,如果有效的话。

提前致谢!

4 个答案:

答案 0 :(得分:1)

嵌套循环不是最简单的解决方案吗?

import random
ages = [random.randint(18, 100) for _ in range(100)]
age_ranges = [(18,29), (30,39), (40,49), (50,59), (60,69),(70,)]

for a in ages:
        for r in age_ranges:
                if a >= r[0] and (len(r) == 1 or a < r[1]):
                        print a,r
                        break

答案 1 :(得分:1)

看起来您正在使用Pandas库。它们包括执行此操作的功能:http://pandas.pydata.org/pandas-docs/version/0.16.0/generated/pandas.cut.html

这是我的尝试:

import pandas as pd

ages = pd.DataFrame([81, 42, 18, 55, 23, 35], columns=['age'])

bins = [18, 30, 40, 50, 60, 70, 120]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70+']
ages['agerange'] = pd.cut(ages.age, bins, labels = labels,include_lowest = True)

print(ages)

   age agerange
0   81      70+
1   42    40-49
2   18    18-29
3   55    50-59
4   23    18-29
5   35    30-39

答案 2 :(得分:0)

您可以itertools.groupby使用// 10作为关键功能。

In [10]: ages = [random.randint(18, 99) for _ in range(100)]

In [11]: [(key, list(group)) for (key, group) in itertools.groupby(sorted(ages), key=lambda x: x // 10)]
Out[11]: 
[(1, [18]),
 (2, [20, 21, 21, 22, 23, 24, 25, 26, 26, 26, 27, 27, 28]),
 (3, [30, 30, 32, 32, 34, 35, 36, 37, 37]),
 (4, [41, 42, 42, 43, 43, 44, 45, 47, 48]),
 (5, [50, 51, 52, 53, 54, 55, 56, 56, 56, 56, 57, 58, 58, 58, 58]),
 (6, [60, 61, 62, 62, 62, 65, 65, 66, 66, 66, 66, 67, 69, 69, 69]),
 (7, [71, 71, 72, 72, 73, 75, 75, 77, 77, 78]),
 (8, [83, 83, 83, 83, 84, 84, 85, 86, 86, 87, 87, 88, 89, 89, 89]),
 (9, [91, 91, 92, 92, 93, 94, 97, 97, 98, 98, 99, 99, 99])]

请记住groupby需要排序数据,所以先排序。或者使用字典和循环手动完成。

In [14]: groups = collections.defaultdict(list)

In [15]: for x in ages:
   ....:     groups[x//10].append(x)

In [16]: groups
Out[16]: defaultdict(<type 'list'>, {1: [18], 
             2: [26, 28, 21, 20, 26, 24, 21, 27, 25, 23, 27, 26, 22], 
             3: [37, 30, 32, 32, 35, 30, 36, 37, 34], 
             4: [45, 42, 43, 41, 47, 43, 48, 44, 42], 
             5: [52, 56, 58, 55, 58, 51, 58, 58, 57, 56, 53, 56, 50, 54, 56], 
             6: [69, 65, 62, 61, 65, 66, 66, 62, 69, 66, 67, 66, 60, 62, 69], 
             7: [71, 77, 71, 72, 77, 73, 78, 72, 75, 75], 
             8: [87, 83, 84, 86, 86, 83, 83, 87, 85, 83, 89, 88, 84, 89, 89], 
             9: [99, 92, 99, 98, 91, 94, 97, 92, 98, 97, 91, 93, 99]})

对于更复杂的分组,您可以使key函数任意复杂化。例如,将所有70岁及以上的人分成一组,使​​用lambda x: min(x // 10, 7)。这适用于两种方法。如果您愿意,甚至可以将密钥转换为字符串:

In [23]: keyfunc = lambda x: "{0}0-{0}9".format(x//10) if x < 70 else "70+"
In [24]: [(key, list(group)) for (key, group) in itertools.groupby(sorted(ages), key=keyfunc)]
Out[24]: 
[('10-19', [18]),
 ('20-29', [20, 21, 21, 22, 23, 24, 25, 26, 26, 26, 27, 27, 28]),
 ('30-39', [30, 30, 32, 32, 34, 35, 36, 37, 37]),
 ('40-49', [41, 42, 42, 43, 43, 44, 45, 47, 48]),
 ('50-59', [50, 51, 52, 53, 54, 55, 56, 56, 56, 56, 57, 58, 58, 58, 58]),
 ('60-69', [60, 61, 62, 62, 62, 65, 65, 66, 66, 66, 66, 67, 69, 69, 69]),
 ('70+',   [all the rest]]

答案 3 :(得分:0)

一位朋友在线下提出了这个响应: def age_buckets(x):     如果x&lt; 30:         返回&#39; 18-29&#39;     elif x&lt; 40:         返回&#39; 30-39&#39;     elif x&lt; 50:         返回&#39; 40-49&#39;     elif x&lt; 60:         返回&#39; 50-59&#39;     elif x&lt; 70:         返回&#39; 60-69&#39;     elif x&gt; = 70:         返回&#39; 70 +&#39;     其他:         返回&#39;其他&#39;

file['agerange'] = file.age.apply(age_buckets)

感谢所有对此有所了解的人!