如何使用一组SQL表达式将列添加到SparkDataFrame?

时间:2016-10-30 14:24:05

标签: apache-spark sparkr

我正在使用spark R,我想基于现有列的字符串修改向SparkDataFrame添加一列。请考虑以下SparkDataFrame:

head(df)
  id                                                 address
  1   street_X, postal_code_X, neighborhood_X, county_name_X
  2                            neighborhood_Y, county_name_Y
  3             postal_code_Z, neighborhood_Z, county_name_Z

我需要添加一个只包含邻域的列。我设法将此列提取到新的SparkDataFrame中:

new_df <- selectExpr(df, "SUBSTRING_INDEX(address, ',', -2) AS neighborhood")
new_df <- selectExpr(new_df, "SUBSTRING_INDEX(neighborhood, ',', 1) AS neighborhood")

head(new_df)

neighborhood
neighborhood_X
neighborhood_Y
neighborhood_Z

但是如何将此列邻域添加到原始df(相当于R / I中的cbind检查withColumn,但是没有设法将它与selectExpr结合使用)?

1 个答案:

答案 0 :(得分:2)

尝试这样的事情

只需选择其他列

即可
new_df <- selectExpr(df, "id", "address", 
  "SUBSTRING_INDEX(SUBSTRING_INDEX(address, ',', -2), ',', 1) AS neighborhood")

这也可能

new_df <- selectExpr(df, "*", 
  "SUBSTRING_INDEX(SUBSTRING_INDEX(address, ',', -2), ',', 1) AS neighborhood")