Question

我有一个Pandas数据框，其中包含按燃料类型分类的发电机组容量（MW）的数据。我想用两种不同的方式显示估计的工厂产能分布：按工厂（简单）和按兆瓦（较硬）。这是一个示例：

# import libraries
import pandas as pd
import numpy as np
import seaborn as sns

# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])

# create and seed a randomstate object (to make #s repeatable below)
rnd = np.random.RandomState(7)

# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
    mymean = rnd.uniform(low=2.8,high=3.2)
    mysigma = rnd.uniform(low=0.6,high=1.0)
    df = df.append(
                   pd.DataFrame({'Fuel': myfuel,
                        'MW': np.array(rnd.lognormal(mean=mymean,sigma=mysigma,size=1000))
                       }),
                   ignore_index=True
                   )

# make violinplot
sns.violinplot(x = 'Fuel',
               y = 'MW',
               data=df,
               inner=None,
               scale='area',
               cut=0,
               linewidth=0.5
              )

这是此代码制作的按兆瓦估算的工厂规模分布图：

此小提琴图非常具有欺骗性，无需更多上下文。由于没有加权，因此每个类别顶部的细尾巴隐藏了以下事实：尾巴中相对较少的工厂具有很多（也许甚至是大部分）兆瓦容量。所以我想要第二张图，其分布为 MWs -基本是第一张小提琴图的加权版本。

我想知道是否有人找到了制作这种“加权”小提琴图的优雅方法，或者是否有人对这样做的最优雅方法有所了解。

我认为我可以遍历工厂级数据帧的每一行，并将工厂数据（分解为新的数据帧）分解为MW级数据。例如，对于工厂级别数据框中显示350 MW的工厂的一行，我可以将其分解为新数据框中的3500新行，每行代表100 kW的容量。（我认为我必须至少达到100 kW的分辨率，因为其中一些电厂的规模很小，在100 kW的范围内。）这个新的数据框将是巨大的，但是我可以对分解后的数据进行小提琴绘图数据。那似乎有些蛮力。有更好的方法吗？

更新：

我实现了上述的蛮力方法。如果有人感兴趣，这就是它的样子。这不是这个问题的“答案”，因为如果有人知道一种更优雅/更简单/更高效的方法，我仍然会很感兴趣。因此，如果您知道这种方式，请发出提示。否则，我希望这种蛮力方法将来可能对某人有所帮助。

因此，很容易看出加权小提琴图是有道理的，我用一个简单的统一数字序列（从0到10）替换了随机数据。在这种新方法下，df的小提琴图应该看起来非常均匀，而小提琴图加权数据（dfw）的值应逐渐稳定到小提琴的顶部。就是这样（参见下面的小提琴图）。

# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])

# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
    df = df.append(
                   pd.DataFrame({'Fuel': myfuel,
                        # To make it easy to see that the violinplot of dfw (below)
                        # makes sense, here we'll just use a simple range list from
                        # 0 to 10
                        'MW': np.array(range(11))
                       }),
                   ignore_index=True
                   )

# I have to recast the data type here to avoid an error when using violinplot below
df.MW = df.MW.astype(float)

# create another empty dataframe
dfw = pd.DataFrame(data=None,columns=['Fuel','MW'])
# since dfw will be huge, specify data types (in particular, use "category" for Fuel to limit dfw size)
dfw = dfw.astype(dtype={'Fuel':'category', 'MW':'float'})

# Define the MW size by which to normalize all of the units
# Careful: too big -> loss of fidelity in data for small plants
#          too small -> dfw will need to store an enormous amount of data
norm = 0.1 # this is in MW, so 0.1 MW = 100 kW

# Define a var to represent (for each row) how many basic units
# of size = norm there are in each row
mynum = 0

# loop through rows of df
for index, row in df.iterrows():

    # calculate and store the number of norm MW there are within the MW of each plant
    mynum = int(round(row['MW']/norm))

    # insert mynum rows into dfw, each with Fuel = row['Fuel'] and MW = row['MW']
    dfw = dfw.append(
                   pd.DataFrame({'Fuel': row['Fuel'],
                                 'MW': np.array([row['MW']]*mynum,dtype='float')
                                 }),
                                 ignore_index=True
                    )


# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey='row')

# make violinplot
sns.violinplot(x = 'Fuel',
               y = 'MW',
               data=df,
               inner=None,
               scale='area',
               cut=0,
               linewidth=0.5,
               ax = ax1
              )   

# make violinplot
sns.violinplot(x = 'Fuel',
               y = 'MW',
               data=dfw,
               inner=None,
               scale='area',
               cut=0,
               linewidth=0.5,
               ax = ax2
              ) 

# loop through the set of tick labels for both axes
# set tick label size and rotation
for item in (ax1.get_xticklabels() + ax2.get_xticklabels()): 
    item.set_fontsize(8)
    item.set_rotation(30)
    item.set_horizontalalignment('right')

plt.show()

加权小提琴图

0 个答案: