我在Hadoop HDFS上存储了.csv文件,
hadoop dfs -ls /afs
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
17/01/12 15:15:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 item
-rw-r--r-- 2 hduser supergroup 203572404 2017-01-10 12:04 /afs/Accounts.csv
我想使用SparkR将此文件导入rstudio。
我尝试了以下命令:
sc<-sparkR.session(master = "spark://MasterNode:7077",appName = "SparkR",sparkHome = "/opt/spark")
sContext<- sparkRSQL.init(sc)
library(data.table)
library(dplyr)
df<- read.df(sContext, "hdfs://MasterNode:54310/afs/Accounts.csv")
发生以下错误:
> df<- read.df(sContext, "hdfs://MasterNode:54310/afs/Accounts.csv")
Error in handleErrors(returnStatus, conn) :
No status is returned. Java SparkR backend might have failed.
In addition: Warning message:
In writeBin(requestMessage, conn) : problem writing to connection
请帮我用SparkR将Accounts.csv文件导入rstudio。
答案 0 :(得分:1)
您可以使用fread
库的data.table
功能从HDFS读取。您必须在系统中指定hdfs
可执行文件的路径。例如,假设hdfs的路径是/usr/bin/hdfs
,您可以尝试这样的事情:
your_table <- fread("/usr/bin/hdfs dfs -text /afs/Accounts.csv")
如果您的“Accounts.csv”是一个目录,您也可以使用通配符/afs/Accounts.csv/*
您还可以指定列类。例如:
your_table <- fread("/usr/bin/hdfs dfs -text /afs/Accounts.csv", fill = TRUE, header = TRUE,
colClasses = c("numeric", "character", ...))
我希望这会有所帮助。