如何处理SparkR

时间:2016-08-24 13:53:07

标签: r apache-spark apache-spark-sql sparkr

我正在尝试将本地csv文件加载到SparkR中,其中包含列名称中的点。在阅读文件后,我试图更改名称并替换为“。”用“_”。我仍然无法对创建的SDF进行任何操作。 这是可重现的代码:

#writing iris dataset to local
write.csv(iris,"iris.csv",row.names=F)

#reading it back using read.df
iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true")

#changing column names
names(iris_sdf)<-c("Sepal_Length","Sepal_Width","Petal_Length","Petal_Width","Species")

#selecting required columna
head(select(iris_sdf,iris_sdf$Sepal_Length,iris_sdf$Sepal_Width))

运行这段代码我遇到以下错误:

16/08/24 13:51:24 ERROR RBackendHandler: dfToCols on org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
  org.apache.spark.sql.AnalysisException: Unable to resolve Sepal.Length given [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species];
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.IterableLike$cl

我该怎么做才能让它发挥作用?

2 个答案:

答案 0 :(得分:0)

使用点时,SQL语法认为您指的是表。如果它是包含点的列,则需要在名称

周围放置回形符
"`your.column`"

答案 1 :(得分:0)

这里唯一的解决方法是为读者提供架构:

schema <- structType(
   structField("SepalLength", "double", FALSE),
   structField("SepalWidth",  "double", FALSE),
   structField("PetalLength", "double", FALSE),
   structField("PetalWidth",  "double", FALSE),
   structField("Species",     "string", FALSE))


head(read.df("iris.csv", "csv", header="true", schema=schema))
##   SepalLength SepalWidth PetalLength PetalWidth Species
## 1         5.1        3.5         1.4        0.2  setosa
## 2         4.9        3.0         1.4        0.2  setosa
## 3         4.7        3.2         1.3        0.2  setosa
## 4         4.6        3.1         1.5        0.2  setosa
## 5         5.0        3.6         1.4        0.2  setosa
## 6         5.4        3.9         1.7        0.4  setosa