brand,month,price
abc,jan, - \n
abc,feb, 29 \n
abc,mar, - \n
abc,apr, 45.23 \n
bb-c,jan, 34 \n
bb-c,feb,-35 \n
bb-c,mar, - \n
总和(价格)groupby(品牌)
挑战
1)csv file available in xl sheet
2)trim the extra spaces in price
3)replace non-numeric(" - ") with zero
4)sum the price group by brand
- 将csv文件读取到df1
- 将价格数据类型字符串更改为双倍
- 在df1上创建注册的临时表
- 但仍然面临修剪和问题
- 将零替换为非数字
有人可以帮我解决这个问题。
答案 0 :(得分:0)
理论解释:
简单使用 sqlContext读取csv文件, regexp_replace内置函数将字符串替换为双精度(强制转换)和 groupBy和sum aggregation 应该得到你想要的输出,
以编程方式解释:
//1)csv file available in xl sheet
val df = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", true)
.load("path to the csv file")
df.show(false)
//+-----+-----+------+
//|brand|month|price |
//+-----+-----+------+
//|abc |jan | - |
//|abc |feb | 29 |
//|abc |mar | - |
//|abc |apr | 45.23|
//|bb-c |jan | 34 |
//|bb-c |feb |-35 |
//|bb-c |mar | - |
//+-----+-----+------+
import org.apache.spark.sql.functions._
//2)trim the extra spaces in price
//3)replace non-numeric(" - ") with zero
df.withColumn("price", regexp_replace(col("price"), "[\\s+a-zA-Z- :]", "").cast("double"))
//4)sum the price group by brand
.groupBy("brand")
.agg(sum("price").as("price_sum"))
.show(false)
//+-----+-----------------+
//|brand|price_sum |
//+-----+-----------------+
//|abc |74.22999999999999|
//|bb-c |69.0 |
//+-----+-----------------+
我希望答案很有帮助