scala / park中的数据转换

时间:2018-03-18 07:45:21

标签: scala apache-spark

brand,month,price
abc,jan, - \n 
abc,feb, 29  \n
abc,mar, -   \n
abc,apr, 45.23  \n
bb-c,jan, 34  \n
bb-c,feb,-35  \n
bb-c,mar, - \n

总和(价格)groupby(品牌)

挑战

1)csv file available in xl sheet
2)trim the extra spaces in price
3)replace non-numeric(" -   ") with zero
4)sum the price group by brand

- 将csv文件读取到df1
- 将价格数据类型字符串更改为双倍 - 在df1上创建注册的临时表
- 但仍然面临修剪和问题 - 将零替换为非数字

有人可以帮我解决这个问题。

1 个答案:

答案 0 :(得分:0)

理论解释:

简单使用 sqlContext读取csv文件 regexp_replace内置函数将字符串替换为双精度(强制转换) groupBy和sum aggregation 应该得到你想要的输出,

以编程方式解释:

//1)csv file available in xl sheet
val df = sqlContext
  .read
  .format("com.databricks.spark.csv")
  .option("header", true)
  .load("path to the csv file")

df.show(false)
  //+-----+-----+------+
  //|brand|month|price |
  //+-----+-----+------+
  //|abc  |jan  | -    |
  //|abc  |feb  | 29   |
  //|abc  |mar  | -    |
  //|abc  |apr  | 45.23|
  //|bb-c |jan  | 34   |
  //|bb-c |feb  |-35   |
  //|bb-c |mar  | -    |
  //+-----+-----+------+  

import org.apache.spark.sql.functions._
//2)trim the extra spaces in price
//3)replace non-numeric(" -   ") with zero
df.withColumn("price", regexp_replace(col("price"), "[\\s+a-zA-Z- :]", "").cast("double"))
//4)sum the price group by brand    
    .groupBy("brand")
    .agg(sum("price").as("price_sum"))
    .show(false)
//+-----+-----------------+
//|brand|price_sum        |
//+-----+-----------------+
//|abc  |74.22999999999999|
//|bb-c |69.0             |
//+-----+-----------------+

我希望答案很有帮助