我有一个Pandas数据框,其中包含按燃料类型分类的发电机组容量(MW)的数据。我想用两种不同的方式显示估计的工厂产能分布:按工厂(简单)和按兆瓦(较硬)。这是一个示例:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# create and seed a randomstate object (to make #s repeatable below)
rnd = np.random.RandomState(7)
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
mymean = rnd.uniform(low=2.8,high=3.2)
mysigma = rnd.uniform(low=0.6,high=1.0)
df = df.append(
pd.DataFrame({'Fuel': myfuel,
'MW': np.array(rnd.lognormal(mean=mymean,sigma=mysigma,size=1000))
}),
ignore_index=True
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5
)
这是此代码制作的按兆瓦估算的工厂规模分布图:
此小提琴图非常具有欺骗性,无需更多上下文。由于没有加权,因此每个类别顶部的细尾巴隐藏了以下事实:尾巴中相对较少的工厂具有很多(也许甚至是大部分)兆瓦容量。所以我想要第二张图,其分布为 MWs -基本是第一张小提琴图的加权版本。
我想知道是否有人找到了制作这种“加权”小提琴图的优雅方法,或者是否有人对这样做的最优雅方法有所了解。
我认为我可以遍历工厂级数据帧的每一行,并将工厂数据(分解为新的数据帧)分解为MW级数据。例如,对于工厂级别数据框中显示350 MW的工厂的一行,我可以将其分解为新数据框中的3500新行,每行代表100 kW的容量。 (我认为我必须至少达到100 kW的分辨率,因为其中一些电厂的规模很小,在100 kW的范围内。)这个新的数据框将是巨大的,但是我可以对分解后的数据进行小提琴绘图数据。那似乎有些蛮力。有更好的方法吗?
更新:
我实现了上述的蛮力方法。如果有人感兴趣,这就是它的样子。这不是这个问题的“答案”,因为如果有人知道一种更优雅/更简单/更高效的方法,我仍然会很感兴趣。因此,如果您知道这种方式,请发出提示。否则,我希望这种蛮力方法将来可能对某人有所帮助。
因此,很容易看出加权小提琴图是有道理的,我用一个简单的统一数字序列(从0到10)替换了随机数据。在这种新方法下,df的小提琴图应该看起来非常均匀,而小提琴图加权数据(dfw)的值应逐渐稳定到小提琴的顶部。就是这样(参见下面的小提琴图)。
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
df = df.append(
pd.DataFrame({'Fuel': myfuel,
# To make it easy to see that the violinplot of dfw (below)
# makes sense, here we'll just use a simple range list from
# 0 to 10
'MW': np.array(range(11))
}),
ignore_index=True
)
# I have to recast the data type here to avoid an error when using violinplot below
df.MW = df.MW.astype(float)
# create another empty dataframe
dfw = pd.DataFrame(data=None,columns=['Fuel','MW'])
# since dfw will be huge, specify data types (in particular, use "category" for Fuel to limit dfw size)
dfw = dfw.astype(dtype={'Fuel':'category', 'MW':'float'})
# Define the MW size by which to normalize all of the units
# Careful: too big -> loss of fidelity in data for small plants
# too small -> dfw will need to store an enormous amount of data
norm = 0.1 # this is in MW, so 0.1 MW = 100 kW
# Define a var to represent (for each row) how many basic units
# of size = norm there are in each row
mynum = 0
# loop through rows of df
for index, row in df.iterrows():
# calculate and store the number of norm MW there are within the MW of each plant
mynum = int(round(row['MW']/norm))
# insert mynum rows into dfw, each with Fuel = row['Fuel'] and MW = row['MW']
dfw = dfw.append(
pd.DataFrame({'Fuel': row['Fuel'],
'MW': np.array([row['MW']]*mynum,dtype='float')
}),
ignore_index=True
)
# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey='row')
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax1
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=dfw,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax2
)
# loop through the set of tick labels for both axes
# set tick label size and rotation
for item in (ax1.get_xticklabels() + ax2.get_xticklabels()):
item.set_fontsize(8)
item.set_rotation(30)
item.set_horizontalalignment('right')
plt.show()