将在线csv转换为dataframe scala的最佳方法

时间:2017-07-18 19:05:33

标签: scala apache-spark

我正在尝试找出将此在线csv文件放入Scala数据框中的最有效方法。

要保存下载,代码中的csv文件如下所示:

"Symbol","Name","LastSale","MarketCap","ADR 
TSO","IPOyear","Sector","Industry","Summary Quote"
"DDD","3D Systems Corporation","18.09","2058834640.41","n/a","n/a","Technology","Computer Software: Prepackaged Software","http://www.nasdaq.com/symbol/ddd"
"MMM","3M Company","211.68","126423673447.68","n/a","n/a","Health Care","Medical/Dental Instruments","http://www.nasdaq.com/symbol/mmm"
....

从我的研究开始,我首先下载csv,并将其放入列表缓冲区(因为你不能用列表做这个,因为它是不可变的):

import scala.collection.mutable.ListBuffer

val sc = new SparkContext(conf)

var stockInfoNYSE_ListBuffer = new ListBuffer[java.lang.String]()


import scala.io.Source
    val bufferedSource = 
    Source.fromURL("http://www.nasdaq.com/screening/companies-by-
    industry.aspx?exchange=NYSE&render=download")

for (line <- bufferedSource.getLines) {
    val cols = line.split(",").map(_.trim)

    stockInfoNYSE_ListBuffer += s"${cols(0)},${cols(1)},${cols(2)},${cols(3)},${cols(4)},${cols(5)},${cols(6)},${cols(7)},${cols(8)}"

}
bufferedSource.close

val stockInfoNYSE_List = stockInfoNYSE_ListBuffer.toList

所以我们有一个清单。你基本上可以得到这样的每个值:

// SYMBOL : stockInfoNYSE_List(1).split(",")(0)
// COMPANY NAME : stockInfoNYSE_List(1).split(",")(1)
// IPOYear : stockInfoNYSE_List(1).split(",")(5)
// Sector : stockInfoNYSE_List(1).split(",")(6)
// Industry : stockInfoNYSE_List(1).split(",")(7)

这是我陷入困境的地方 - 我如何将其转化为数据框?我采取了错误的方法。我还没有把所有的值都放在一个简单的测试中。

case class StockMap(Symbol: String, Name: String)
val caseClassDS = Seq(StockMap(stockInfoNYSE_List(1).split(",")(0), 
StockMap(stockInfoNYSE_List(1).split(",")(1))).toDS()

caseClassDS.show()

上述方法的问题:我只能通过硬编码来弄清楚如何添加一个序列(行)。我想要列表中的每一行。

我的第二次失败尝试:

val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val test = stockInfoNYSE_List.toDF

这只会给你一个数组,我想把它们分开。

Array(["Symbol","Name","LastSale","MarketCap","ADR TSO","IPOyear","Sector","Industry","Summary Quote"], ["DDD","3D Systems Corporation","18.09","2058834640.41","n/a","n/a","Technology","Computer Software: Prepackaged Software","http://www.nasdaq.com/symbol/ddd"], ["MMM","3M Company","211.68","126423673447.68","n/a","n/a","Health Care","Medical/Dental Instruments","http://www.nasdaq.com/symbol/mmm"],....... 

2 个答案:

答案 0 :(得分:0)

case class TestClass(Symbol:String,Name:String,LastSale:String,MarketCap :String,ADR_TSO:String,IPOyear:String,Sector: String,Industry:String,Summary_Quote:String
     | )
 defined class TestClass

var stockDF= stockInfoNYSE_ListBuffer.drop(1)

val demoDS = stockDF.map(line => {
  val fields = line.replace("\"","").split(",")
  TestClass(fields(0), fields(1), fields(2),fields(3), fields(4), fields(5),fields(6), fields(7), fields(8))
})

scala> demoDS.toDS.show

+------+--------------------+--------+---------------+-------------+-------+-----------------+--------------------+--------------------+
|Symbol|                Name|LastSale|      MarketCap|      ADR_TSO|IPOyear|           Sector|            Industry|       Summary_Quote|
+------+--------------------+--------+---------------+-------------+-------+-----------------+--------------------+--------------------+
|   DDD|3D Systems Corpor...|   18.09|  2058834640.41|          n/a|    n/a|       Technology|Computer Software...|http://www.nasdaq...|
|   MMM|          3M Company|  211.68|126423673447.68|          n/a|    n/a|      Health Care|Medical/Dental In...|http://www.nasdaq...|

答案 1 :(得分:0)

如果有人试图让这个例子工作,这里是使用上述解决方案的代码:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import scala.collection.mutable.ListBuffer
import sqlContext.implicits._

var stockInfoNYSE_ListBuffer = new ListBuffer[java.lang.String]()

import scala.io.Source
    val bufferedSource =
    Source.fromURL("http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NYSE&render=download")

for (line <- bufferedSource.getLines) {
    val cols = line.split(",").map(_.trim)

    stockInfoNYSE_ListBuffer += s"${cols(0)},${cols(1)},${cols(2)},${cols(3)},${cols(4)},${cols(5)},${cols(6)},${cols(7)},${cols(8)}"

}
bufferedSource.close



case class TestClass(Symbol:String,Name:String,LastSale:String,MarketCap :String,ADR_TSO:String,IPOyear:String,Sector: String,Industry:String,Summary_Quote:String )

var stockDF= stockInfoNYSE_ListBuffer.drop(1)

val demoDS = stockDF.map(line => {
  val fields = line.replace("\"","").split(",")
  TestClass(fields(0), fields(1), fields(2),fields(3), fields(4), fields(5),fields(6), fields(7), fields(8))
})

demoDS.toDF().show