我是新来的火花& pyspark。
我正在将一个小的csv文件(~40k)读入数据帧。
from pyspark.sql import functions as F
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('/tmp/sm.csv')
df = df.withColumn('verified', F.when(df['verified'] == 'Y', 1).otherwise(0))
df2 = df.map(lambda x: Row(label=float(x[0]), features=Vectors.dense(x[1:]))).toDF()
我得到一些奇怪的错误,每次都不会发生,但确实经常发生
>>> df2.show(1)
+--------------------+---------+
| features| label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|4700734.0|
+--------------------+---------+
only showing top 1 row
>>> df2.count()
41999
>>> df2.show(1)
+--------------------+---------+
| features| label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|4700734.0|
+--------------------+---------+
only showing top 1 row
>>> df2.count()
41999
>>> df2.show(1)
Traceback (most recent call last):
File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker
File "spark-1.6.1/python/lib/pyspark.zip/pyspark/worker.py", line 136, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "spark-1.6.1/python/lib/pyspark.zip/pyspark/serializers.py", line 545, in read_int
raise EOFError
EOFError
+--------------------+---------+
| features| label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|4700734.0|
+--------------------+---------+
only showing top 1 row
一旦提出了EOFError,我就不会再看到它了,直到我做了一些需要与spark服务器交互的东西
当我调用df2.count()时,它会显示[Stage xxx]提示符,这是我去火花服务器的意思。当我使用df2执行某些操作时,任何触发该操作的内容最终都会最终再次出现EOFError。
df(与df2相似)似乎没有发生,所以看起来它必须是df.map()行发生的事情。
答案 0 :(得分:0)
将数据帧转换为rdd后,请尝试做地图吗?您正在数据框上应用map函数,然后再次从该数据框创建数据框。语法就像
df.rdd.map().toDF()
如果有效,请告诉我。谢谢。
答案 1 :(得分:0)
我相信你正在运行Spark 2.x及更高版本。下面的代码应该从csv:
创建数据框public function get_billing_email() {
$billing_email = $this->order->billing_email;
return apply_filters( 'xc_woo_cloud_print_billing_email', $billing_email, $this );
}
public function billing_email() {
echo $this->get_billing_email();
}
然后你可以拥有以下代码:
df = spark.read.format("csv").option("header", "true").load("csvfile.csv")
然后你可以创建没有Row的df2和toDF()
让我知道这是否有效或者您使用的是Spark 1.6 ...谢谢。