如何将substr()函数用于sparkR中的列

时间:2016-02-11 12:09:35

标签: apache-spark sparkr

如何将substr()函数用于sparkR

中的数据框列
+----------+----------------+-----------+
|   cust_id|  tran_datetime |Total_trans|
+----------+----------------+-----------+
|CQ98901297|2015-06-06 09:00|          1|
|CQ98901297|2015-05-01 09:25|          1|
|CQ98901297|2015-05-02 10:45|          1|
|CQ98901297|2015-05-03 11:01|          1|

我需要在tran_datetime

中缩短时间

2 个答案:

答案 0 :(得分:0)

#use substr(df, start position, End position) in the select() function
df_new <- select(df, df$cust_id , substr(df$tran_datetime, 1, 10), df$Total_trans)
#In the df_new you get a random column name for the column where you used substr(), so use rename() to get the desired column name
df_new <- rename(df_new, date = df_new[[2]])

showDF(df_new)

+----------+----------+-----------+
|   cust_id|  date    |Total_trans|
+----------+----------+-----------+
|CQ98901297|2015-06-06|          1|
|CQ98901297|2015-05-01|          1|
|CQ98901297|2015-05-02|          1|
|CQ98901297|2015-05-03|          1|

答案 1 :(得分:-1)

我想最好的解决方案是应用strsplit。

x <- data.frame(lin=c('+----------+----------------+-----------+',
                      '|   cust_id|  tran_datetime |Total_trans|',
                      '+----------+----------------+-----------+',
                      '|CQ98901297|2015-06-06 09:00|          1|',
                      '|CQ98901297|2015-05-01 09:25|          1|',
                      '|CQ98901297|2015-05-02 10:45|          1|'),
                id = 1:6,
                stringsAsFactors = F)
#removing the lines that starts with +
x <- x[substr(x$lin,1,1)!="+",]
# spliting the line into columns pipe-separed
y <- strsplit(x$lin,split = "\\|")
#removing whitespaces after split
library(stringr)
y <- lapply(y, function(x){str_trim(x,'both')})
# [,-1] because the first column is empty
y <- do.call(rbind,y)[,-1]
colnames(y) <- y[1,]
y <- data.frame(y[-1,],stringsAsFactors = F)
y