我正在尝试使用Seaborn可视化数据。我在pyspark中使用SQLContext创建了一个数据帧。但是,当我调用lmplot时会导致错误。我不确定我错过了什么。下面给出的是我的代码(我使用的是jupyter笔记本):
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load('file:///home/cloudera/Downloads/WA_Sales_Products_2012-14.csv',
format='com.databricks.spark.csv',
header='true',inferSchema='true')
sns.lmplot(x='Quantity', y='Year', data=df)
Error trace:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-86-2a2b43993475> in <module>()
----> 2 sns.lmplot(x='Quantity', y='Year', data=df)
/home/cloudera/anaconda3/lib/python3.5/site-packages/seaborn/regression.py in lmplot(x, y, data, hue, col, row, palette, col_wrap, size, aspect, markers, sharex, sharey, hue_order, col_order, row_order, legend, legend_out, x_estimator, x_bins, x_ci, scatter, fit_reg, ci, n_boot, units, order, logistic, lowess, robust, logx, x_partial, y_partial, truncate, x_jitter, y_jitter, scatter_kws, line_kws)
557 hue_order=hue_order, size=size, aspect=aspect,
558 col_wrap=col_wrap, sharex=sharex, sharey=sharey,
--> 559 legend_out=legend_out)
560
561 # Add the markers here as FacetGrid has figured out how many levels of the
/home/cloudera/anaconda3/lib/python3.5/site-packages/seaborn/axisgrid.py in __init__(self, data, row, col, hue, col_wrap, sharex, sharey, size, aspect, palette, row_order, col_order, hue_order, hue_kws, dropna, legend_out, despine, margin_titles, xlim, ylim, subplot_kws, gridspec_kws)
255 # Make a boolean mask that is True anywhere there is an NA
256 # value in one of the faceting variables, but only if dropna is True
--> 257 none_na = np.zeros(len(data), np.bool)
258 if dropna:
259 row_na = none_na if row is None else data[row].isnull()
TypeError: object of type 'DataFrame' has no len()
感谢任何帮助或指针。提前谢谢: - )
答案 0 :(得分:0)
sqlContext.read.load(...)
返回Spark-DataFrame。我不确定,seaborn是否可以自动将Spark-DataFrame转换为Pandas-Dataframe。
尝试:
sns.lmplot(x='Quantity', y='Year', data=df.toPandas())
df.toPandas()
从Spark-DF返回pandas-DF。