我的数据框如下:
scala> products_df.show(5)
+--------------------+
| value|
+--------------------+
|1009|45|Diamond F...|
|1010|46|DBX Vecto...|
|1011|46|Old Town ...|
|1012|46|Pelican T...|
|1013|46|Perceptio...|
+--------------------+
我需要对每一列进行明智的划分-
我使用下面的查询,该查询在所有其他定界符中都有效,但是在这里它不==>
products_df.selectExpr(("cast((split(value,'|'))[0] as int) as product_id"),("cast((split(value,'|'))[1] as int) as product_category_id"),("cast((split(value,'|'))[2] as string) as product_name"),("cast((split(value,'|'))[3] as string) as description"), ("cast((split(value,'|'))[4] as float) as product_price") ,("cast((split(value,'|'))[5] as string) as product_image")).show
它返回-
product_id|product_category_id|product_name|description|product_price|product_image|
+----------+-------------------+------------+-----------+-------------+-------------+
| 1| 0| 0| 9| null| 4|
| 1| 0| 1| 0| null| 4|
| 1| 0| 1| 1| null| 4|
| 1| 0| 1| 2| null| 4|
| 1| 0| 1| 3| null| 4|
| 1| 0| 1| 4| null| 4|
| 1| 0| 1| 5| null| 4|
当文件用逗号(,)或(:)分隔时,它可以正常工作 仅使用pipe(|)并返回上述值,而应为
product_id|product_category_id| product_name|description|product_price| product_image|
+----------+-------------------+--------------------+-----------+-------------+--------------------+
| 1009| 45|Quest Q64 10 FT. ...| | 59.98|http://images.acm...|
| 1010| 46|Under Armour Men'...| | 129.99|http://images.acm...|
| 1011| 47|Under Armour Men'...| | 89.99|http://images.acm...|
答案 0 :(得分:0)
谢谢你们的建议- ->当文件由竖线(|)分隔时,似乎selectExpr不起作用。 因此另一种方法是使用withColumn。
val products_df = spark.read.textFile(“ / user / code / products”)。withColumn(“ product_id”,split($“ value”,“ \ |”)(0).cast(“ int”) ).withColumn(“ product_cat_id”,split($“ value”,“ \ |”)(1).cast(“ int”))。withColumn(“ product_name”,split($“ value”,“ \ |”) (2).cast(“ string”))。withColumn(“ product_description”,split($“ value”,“ \ |”)(3).cast(“ string”))。withColumn(“ product_price”,split( $“ value”,“ \ |”)(4).cast(“ float”))。withColumn(“ product_image”,split($“ value”,“ \ |”)(5).cast(“ string”) ).select(“ product_id”,“ product_cat_id”,“ product_name”,“ product_description”,“ product_price”,“ product_image”)
答案 1 :(得分:0)
Spark 2.4.3只需添加简洁明了的代码
scala> var df =Seq(("1009|45|Diamond F"),("1010|46|DBX Vecto")).toDF("value")
scala> df.show
+-----------------+
| value|
+-----------------+
|1009|45|Diamond F|
|1010|46|DBX Vecto|
+-----------------+
val splitedViewsDF = df.withColumn("product_id", split($"value", "\\|").getItem(0)).withColumn("product_cat_id", split($"value", "\\|").getItem(1)).withColumn("product_name", split($"value", "\\|").getItem(2)).drop($"value")
scala> splitedViewsDF.show
+----------+--------------+------------+
|product_id|product_cat_id|product_name|
+----------+--------------+------------+
| 1009| 45| Diamond F|
| 1010| 46| DBX Vecto|
+----------+--------------+------------+
在这里,您可以使用getItem获取数据。快乐的Hadoop