ApacheSpark从http源读取数据帧(例如csv,...)

时间:2017-06-26 11:24:22

标签: java scala hadoop apache-spark apache-spark-sql

我很难从http源(例如csv,...)读取ApacheSpark DataFrame。

HDFS和本地文件有效。

还设法通过使用此命令启动spark-shell来运行AWS S3:

spark-shell --packages org.apache.hadoop:hadoop-core:1.2.1

然后像这样更新hadoop conf:

val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") 
hadoopConf.set("fs.s3.awsAccessKeyId", "****") 
hadoopConf.set("fs.s3.awsSecretAccessKey", "****")

恕我直言,必须存在fs.http.implfs.https.impl参数以及org.apache.hadoop.fs.FileSystem的相应实现。但我还没找到任何东西。

很难相信不支持HTTP(S),因为在Pandas和R中这是一个明智的选择。

我缺少什么想法?顺便说一句,这是失败的代码块:

val df=spark.read.csv("http://raw.githubusercontent.com/romeokienzler/developerWorks/master/companies.csv")

出现以下错误:

  

17/06/26 13:21:51 WARN DataSource:查找元数据时出错   目录。 java.io.IOException:没有用于scheme的FileSystem:http at   org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)   在   org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)   在org.apache.hadoop.fs.FileSystem.access $ 200(FileSystem.java:94)at   org.apache.hadoop.fs.FileSystem $ Cache.getInternal(FileSystem.java:2703)   在org.apache.hadoop.fs.FileSystem $ Cache.get(FileSystem.java:2685)
  在org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)at   org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)at   org.apache.spark.sql.execution.datasources.DataSource $$ anonfun $ 14.apply(DataSource.scala:372)   在   org.apache.spark.sql.execution.datasources.DataSource $$ anonfun $ 14.apply(DataSource.scala:370)   在   scala.collection.TraversableLike $$ anonfun $ flatMap $ 1.适用(TraversableLike.scala:241)   在   scala.collection.TraversableLike $$ anonfun $ flatMap $ 1.适用(TraversableLike.scala:241)   在scala.collection.immutable.List.foreach(List.scala:381)at   scala.collection.TraversableLike $ class.flatMap(TraversableLike.scala:241)   在scala.collection.immutable.List.flatMap(List.scala:344)at   org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)   在   org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  在org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)   在org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:352)   ...... 48岁了

1 个答案:

答案 0 :(得分:2)

这是重复的:

How to use Spark-Scala to download a CSV file from the web?

只需复制并粘贴答案:

val content = scala.io.Source.fromURL("http://ichart.finance.yahoo.com/table.csv?s=FB").mkString

val list = content.split("\n").filter(_ != "")

val rdd = sc.parallelize(list)

val df = rdd.toDF