这是我的一些数据的示例:
from pandas import *
df = DataFrame({"Experience":['8 to 9 years', '12 to 13 years', '13 to 14 years', '17 to 18 years',
'5 to 6 years', '19 to 20 years', '20 or more years', '14 to 15 years', '3 to 4 years',
'10 to 11 years', 'Less than a year', '4 to 5 years', '6 to 7 years',
'2 to 3 years', '15 to 16 years', '11 to 12 years', '16 to 17 years', '18 to 19 years',
'1 to 2 years', '9 to 10 years', '7 to 8 years', '8 to 9 years',
'12 to 13 years', '13 to 14 years', '14 to 15 years', '3 to 4 years',
'17 to 18 years', '5 to 6 years', '19 to 20 years', '20 or more years',
'10 to 11 years', 'Less than a year', '4 to 5 years', '6 to 7 years',
'2 to 3 years', '15 to 16 years', '11 to 12 years', '16 to 17 years',
'18 to 19 years', '1 to 2 years', '9 to 10 years', '7 to 8 years'],
"Salary":[50000, 20000, 80000, 60000, 70000, 50000, 45000, 47000, 36000, 74000, 50000, 20000, 80000,
60000, 70000, 50000, 45000, 47000, 36000, 74000, 90000, 50000, 20000, 80000, 60000, 70000,
50000, 45000, 47000, 36000, 74000, 50000, 20000, 80000, 60000, 70000, 50000, 45000, 60000,
70000, 50000, 45000]})
df
df['Salary'] = df['Salary'].astype('int64')
这是我用来比较每个经验水平的工资中位数的条形图:
from numpy import median
%matplotlib inline
group = df.groupby('Experience')
group.aggregate(median).plot(kind='barh')
这给了我这张图:
我希望条形图的列名有序(例如“不到一年”,“1到2年”等),但我很挣扎。大熊猫初学者最干净的方法是什么?
答案 0 :(得分:1)
两种方法,第一种是简单和自动,第二种是按一个系统排序,另一种是用另一种系统标记。
最初的问题是,在“体验”中给出文本字符串的条形图按字母顺序排序。我们想要一个数字顺序。一种快速的方法是从字符串中提取数字(使用函数to_min_number
)和组/绘图,然后编辑轴标签,使图形仍然不言自明。
from pandas import *
from matplotlib.pyplot import show
df = DataFrame({"Experience":['8 to 9 years', '12 to 13 years', '13 to 14 years', '17 to 18 years',
'5 to 6 years', '19 to 20 years', '20 or more years', '14 to 15 years', '3 to 4 years',
'10 to 11 years', 'Less than a year', '4 to 5 years', '6 to 7 years',
'2 to 3 years', '15 to 16 years', '11 to 12 years', '16 to 17 years', '18 to 19 years',
'1 to 2 years', '9 to 10 years', '7 to 8 years', '8 to 9 years',
'12 to 13 years', '13 to 14 years', '14 to 15 years', '3 to 4 years',
'17 to 18 years', '5 to 6 years', '19 to 20 years', '20 or more years',
'10 to 11 years', 'Less than a year', '4 to 5 years', '6 to 7 years',
'2 to 3 years', '15 to 16 years', '11 to 12 years', '16 to 17 years',
'18 to 19 years', '1 to 2 years', '9 to 10 years', '7 to 8 years'],
"Salary":[50000, 20000, 80000, 60000, 70000, 50000, 45000, 47000, 36000, 74000, 50000, 20000, 80000,
60000, 70000, 50000, 45000, 47000, 36000, 74000, 90000, 50000, 20000, 80000, 60000, 70000,
50000, 45000, 47000, 36000, 74000, 50000, 20000, 80000, 60000, 70000, 50000, 45000, 60000,
70000, 50000, 45000]})
df
df['Salary'] = df['Salary'].astype('int64')
# Making a new column of Experience values that will plot gracefully
def to_min_number(experience):
t = experience.split(' ')[0]
if t == 'Less': return 0
return int(t)
df['Minimum experience'] = map(to_min_number, df['Experience'])
from numpy import median
group = df.groupby('Minimum experience')
barplot = group.aggregate(median).plot(kind='barh', legend=None)
barplot.set_ylabel('Minimum years experience, non-overlapping')
barplot.set_xlabel('Salary, USD')
show()
如果您必须拥有原始文本字符串,则可以根据Minimum experience
列中的值更改y-tick-labels的文本。自动pandas
绘图只为数字标签腾出空间,因此我们在我们绘制的轴的左边缘强制增加空间:
# We are overriding the barplot defaults, so enforcing a new axis layout
fig, ax = subplots()
subplots_adjust(left=0.3) # Argument is proportion of figure width; found by trial-and-error
barplot = group.aggregate(median).plot(ax=ax, kind='barh', legend=None) # pass it the ax
barplot.set_ylabel('Experience')
barplot.set_xlabel('Salary, USD')
# Need a list of new tick labels in lower-to-upper order. Use the group object, since we have it:
labellist = []
for i, v in group:
labellist.append({'I':int(i), 'T':v.Experience.values[0]})
labeldf = DataFrame(labellist)
barplot.set_yticklabels(labeldf.sort_values(by='I')['T'])
show()
请注意,如果原始文本字符串不是由选择有限的程序生成的,那么您应该对变体进行更多检查:如果有人写了“最多1年”会怎样? “超过20年”?