Question

我有一列pyspark.sql.dataframe.DataFrame（注释），看起来像这样：

Stackoverflow wants posted code before I could post the link
of fonts that work in Gmail I maintain on JSFiddle.
I felt a list of fonts might be useful in choosing a fallback font.

我直接从此数据帧以这种方式映射了一个函数：

+--------------------+
|             comment|
+--------------------+
|                 nan|
|                 nan|
|                 nan|
|So far it has per...|
|I purchased it fo...|
+--------------------+

此后，我将RDD转换回这样的数据帧：

tokens_rdd = comments.select('comment').rdd.flatMap(lambda x: word_tokenizer(x))

在此之后，我尝试显示数据框的前五行，但是出现以下错误：

tokens = sq.createDataFrame(tokens_rdd,comments.schema)

我在本地使用pyspark 2.4.0，我映射的功能是：

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 3, localhost, executor driver): java.net.SocketException: Connection reset

我尝试了一些将RDD转换为DF的方法，但是我没有成功显示数据，也许有人可以帮我弄清楚。

先谢谢了。

Answer 1

在与RDD，Dataframe和Dataset之间进行转换时，有几个辅助函数。我相信您尝试过的操作是将 local 列表转换为Dataframe s。

如果现有RDD，则应该可以使用.toDF()方法。

假设nltk.word_tokenize(x)返回单个令牌字符串列表：

tokens_df = tokens_rdd.toDF("tokens")

应该是您所需要的。

将功能映射到pyspark RDD后无法收集数据

1 个答案: