如何将groupsData转换为R中的Dataframe

时间:2016-04-05 09:17:28

标签: r apache-spark dataframe apache-spark-sql sparkr

考虑我有以下数据框

AccountId,CloseDate
1,2015-05-07
2,2015-05-09
3,2015-05-01
4,2015-05-07
1,2015-05-09
1,2015-05-12
2,2015-05-12
3,2015-05-01
3,2015-05-01
3,2015-05-02
4,2015-05-17
1,2015-05-12

我想基于AccountId对其进行分组,然后我想添加另一个列命名date_diff,它将包含当前行和上一行之间CloseDate的差异。请注意,我希望仅为具有相同AccountId的行计算此date_diff。所以我需要在添加另一列

之前对数据进行分组

以下是我正在使用的R代码

  df <- read.df(sqlContext, "/home/ubuntu/work/csv/sample.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
  df$CloseDate <- to_date(df$CloseDate)
  groupedData <- SparkR::group_by(df, df$AccountId)
  SparkR::mutate(groupedData, DiffCloseDt = as.numeric(SparkR::datediff((CloseDate),(SparkR::lag(CloseDate,1)))))

要添加另一列我正在使用mutate。但是当group_by返回分组数据时,我无法在此处使用mutate。我收到以下错误

 Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘mutate’ for signature ‘"GroupedData"’

那么如何将GroupedData转换为Dataframe以便我可以使用mutate添加列?

1 个答案:

答案 0 :(得分:3)

使用group_by无法实现您想要的效果。正如已经在SO上解释了几次:

group_by上的

DataFrame未对数据进行物理分组。此外,应用group_by后的操作顺序是不确定的。

要获得所需的输出,您必须使用窗口函数并提供明确的排序:

 
df <- structure(list(AccountId = c(1L, 2L, 3L, 4L, 1L, 1L, 2L, 3L, 
  3L, 3L, 4L, 1L), CloseDate = structure(c(3L, 4L, 1L, 3L, 4L, 
  5L, 5L, 1L, 1L, 2L, 6L, 5L), .Label = c("2015-05-01", "2015-05-02", 
  "2015-05-07", "2015-05-09", "2015-05-12", "2015-05-17"), class = "factor")), 
  .Names = c("AccountId", "CloseDate"),
  class = "data.frame", row.names = c(NA, -12L))

hiveContext <- sparkRHive.init(sc)
sdf <- createDataFrame(hiveContext, df)
registerTempTable(sdf, "df")

query <- "SELECT *, LAG(CloseDate, 1) OVER (
  PARTITION BY AccountId ORDER BY CloseDate
) AS DateLag FROM df"

dfWithLag <- sql(hiveContext, query)

withColumn(dfWithLag, "diff", datediff(dfWithLag$CloseDate, dfWithLag$DateLag)) %>%
  head()

##   AccountId  CloseDate    DateLag diff
## 1         1 2015-05-07       <NA>   NA
## 2         1 2015-05-09 2015-05-07    2
## 3         1 2015-05-12 2015-05-09    3
## 4         1 2015-05-12 2015-05-12    0
## 5         2 2015-05-09       <NA>   NA
## 6         2 2015-05-12 2015-05-09    3