Question

我从python的一本机器学习书中获得以下代码：

copy_set.plot(kind = "scatter" , x = "longitude" , 
              y = "latitude" , alpha = 0.4 , 
              s = copy_set[ "population" ], 
              label = "population" , figsize=(10,7), 
              c = "median_house_value" , cmap = plt.get_cmap ( "jet" ) )

median_house_value和population是copy_set数据框中的两列。我不明白为什么对于参数s必须使用copy_set['population']，但是对于参数c只能使用列名median_house_value。当我尝试仅对参数s使用列名时，收到一条错误消息：

TypeError: ufunc 'sqrt' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Answer 1

很好的问题。 df.plot是matplotlib几个绘图函数的包装。对于kind="scatter"，将调用matplotlib的scatter函数。首先将df.plot()的大多数参数转换为Series内的数据，这些数据是从相应名称的数据框的列中获取的。

例如

df.plot(x="lon", y="lat")

将转换为

ax.scatter(x=df["lon"].values, y=df["lat"].values)

其余参数传递给分散点，因此

df.plot(x="lon", y="lat", some_argument_pandas_doesnt_know=True)

将导致

ax.scatter(x=df["lon"].values, y=df["lat"].values, some_argument_pandas_doesnt_know=True)

因此，尽管pandas转换了自变量x，y，c，但s却不这样做。 s因此简单地传递给ax.scatter，但是matplotlib函数不知道像"population"这样的字符串是什么意思。
对于传递给matplotlib函数的参数，需要坚持使用matplotlib的签名，如果使用s，则直接提供数据。

但是请注意，matplotlib的分散本身也允许使用字符串作为其参数。但是，这需要告诉它应从哪个数据集中获取它们。这是通过data参数完成的。因此，以下方法可以很好地工作，并且等同于问题中的pandas调用的matplotlib：

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np; np.random.seed(42)

df = pd.DataFrame(np.random.rand(20,2), columns=["lon", "lat"])
df["pop"] = np.random.randint(5,300,size=20)
df["med"] = np.random.rand(20)*1e5

fig, ax = plt.subplots(figsize=(10,7))
sc = ax.scatter(x = "lon", y = "lat", alpha = 0.4, 
                s = "pop", label = "population" , 
                c = "med" , cmap = "jet", data=df)
fig.colorbar(sc, label="med")
ax.set(xlabel="longitude", ylabel="latitude")

plt.show()

最后，您现在可能会问，通过data参数将数据提供给matplotlib是否同样不能通过熊猫包装器来实现。不幸的是没有，因为熊猫在内部使用data作为参数，因此不会被传递。因此，您有两个选择：

在问题中使用大熊猫，并通过s参数而不是列名提供数据本身。
使用如下所示的matplotlib并为所有参数使用列名。（或者使用数据本身，这是您在查看matplotlib代码时最常看到的。）

pandas.plot参数c vs s

1 个答案: