R:用十进制逗号读取csv数字,包sparklyr

时间:2018-12-27 13:39:05

标签: r apache-spark sparklyr

我需要使用库“ sparklyr”读取类型为“ .csv”的文件,其中的数值以逗号显示。想法是能够直接使用“ spark_read_csv()”进行读取。

我正在使用:

library(sparklyr)
library(dplyr)

f<-data.frame(DNI=c("22-e","EE-4","55-W"), 
DD=c("33,2","33.2","14,55"),CC=c("2","44,4","44,9")) 

write.csv(f,"aff.csv")

sc <- spark_connect(master = "local", spark_home = "/home/tomas/spark-2.1.0-bin-hadoop2.7/", version = "2.1.0")

df <- spark_read_csv(sc, name = "data", path = "/home/tomas/Documentos/Clusterapp/aff.csv", header = TRUE, delimiter = ",")

tbl <- sdf_copy_to(sc = sc, x =df , overwrite = T)

问题所在,将数字读为因数

3 个答案:

答案 0 :(得分:2)

要在spark df中操作字符串,可以使用values = [4 10... 11 2 3;... 4 1... 5 2 -10]; names = {'PreSplitTotalEON' 'PostSplitTotalEON'... 'PreSplitPureEON' 'PostSplitPureEON' 'PostSplitUniper';... 'PreSplitTotalRWE' 'PostSplitTotalRWE'... 'PreSplitPureRWE' 'PostSplitPureRWE' 'PostSplitInnogy'}; categories = {'EON', 'RWE'}; b = bar(values,'FaceColor','flat'); xticklabels([names(1,:)';names(2,:)']) % This will set labels to be used for each tick of the x-axis xticks(1:1:length([names(1,:)';names(2,:)'])) % This will set how many ticks you want on the x-axis. Here, there % should be 48 ticks being generated. One for each piece of data you have. xtickangle(90) % This will rotate the label so that the labels will not overlap % with one another. This is in degrees. for k = 1:size(values,2) % for fancier colors. b(k).CData = k; end 函数,如下所述:

https://spark.rstudio.com/guides/textmining/

对于您的问题,解决方法如下:

regexp_replace

检查结果:

tbl <- sdf_copy_to(sc = sc, x =df, overwrite = T)

tbl0<-tbl%>%
    mutate(DD=regexp_replace(DD,",","."),CC=regexp_replace(CC,",","."))%>%
    mutate_at(vars(c("DD","CC")),as.numeric)

答案 1 :(得分:0)

您可以将数字中的“,”替换为“。”。并将它们转换为数字。例如

df$DD<-as.numeric(gsub(pattern = ",",replacement = ".",x = df$DD))

有帮助吗?

答案 2 :(得分:0)

如果您不想将其替换为“。”也许您可以尝试一下。

spark_read_csv

检查文档。使用 escape 参数指定您要忽略的字符。

在这种情况下,请尝试使用:

df <- spark_read_csv(sc, name = "data", path = "/home/tomas/Documentos/Clusterapp/aff.csv", header = TRUE, delimiter = ",", escape = "\,").