干扰Spark数据帧的每一行的每个句子

时间:2018-03-18 21:38:16

标签: python apache-spark pyspark spark-dataframe

我有一个火花数据帧,我想处理每一行的每个句子(更低,删除标点符号)。

更具体一点:

|text                            |
+---------------------------------
| [this is text ][i want to split]
+---------------------------------

我想得到这个数据帧:

import org.apache.spark.h2o._

import org.apache.spark._
import org.apache.spark.SparkContext._

object App1 extends App{

         val conf = new SparkConf()
         conf.setAppName("Test")
         conf.setMaster("local[1]")
         conf.set("spark.executor.memory","1g");

         val sc = new SparkContext(conf)

         val rawData = sc.textFile("c:\\spark\\data.csv")        
         val data = rawData.map(line => line.split(',').map(_.toDouble))    
         val response: RDD[Int] = data.map(row => row(0).toInt)

         val h2oResponse: H2OFrame = response   // <-- this line throws the error
         sc.stop

}

0 个答案:

没有答案