Spark:数据框管道分隔符未返回正确的值

时间:2019-09-10 06:22:53

标签: dataframe apache-spark split

我的数据框如下:

scala> products_df.show(5)
+--------------------+
|               value|
+--------------------+
|1009|45|Diamond F...|
|1010|46|DBX Vecto...|
|1011|46|Old Town ...|
|1012|46|Pelican T...|
|1013|46|Perceptio...|
+--------------------+

我需要对每一列进行明智的划分-

我使用下面的查询,该查询在所有其他定界符中都有效,但是在这里它不==>

products_df.selectExpr(("cast((split(value,'|'))[0] as int) as product_id"),("cast((split(value,'|'))[1] as int) as product_category_id"),("cast((split(value,'|'))[2] as string) as product_name"),("cast((split(value,'|'))[3] as string) as description"), ("cast((split(value,'|'))[4] as float) as product_price") ,("cast((split(value,'|'))[5] as string) as product_image")).show

它返回-

product_id|product_category_id|product_name|description|product_price|product_image|
+----------+-------------------+------------+-----------+-------------+-------------+
|         1|                  0|           0|          9|         null|            4|
|         1|                  0|           1|          0|         null|            4|
|         1|                  0|           1|          1|         null|            4|
|         1|                  0|           1|          2|         null|            4|
|         1|                  0|           1|          3|         null|            4|
|         1|                  0|           1|          4|         null|            4|
|         1|                  0|           1|          5|         null|            4|

当文件用逗号(,)或(:)分隔时,它可以正常工作 仅使用pipe(|)并返回上述值,而应为

product_id|product_category_id|        product_name|description|product_price|       product_image|
+----------+-------------------+--------------------+-----------+-------------+--------------------+
|      1009|                 45|Quest Q64 10 FT. ...|           |        59.98|http://images.acm...|
|      1010|                 46|Under Armour Men'...|           |       129.99|http://images.acm...|
|      1011|                 47|Under Armour Men'...|           |        89.99|http://images.acm...|

2 个答案:

答案 0 :(得分:0)

谢谢你们的建议- ->当文件由竖线(|)分隔时,似乎selectExpr不起作用。 因此另一种方法是使用withColumn。

val products_df = spark.read.textFile(“ / user / code / products”)。withColumn(“ product_id”,split($“ value”,“ \ |”)(0).cast(“ int”) ).withColumn(“ product_cat_id”,split($“ value”,“ \ |”)(1).cast(“ int”))。withColumn(“ product_name”,split($“ value”,“ \ |”) (2).cast(“ string”))。withColumn(“ product_description”,split($“ value”,“ \ |”)(3).cast(“ string”))。withColumn(“ product_price”,split( $“ value”,“ \ |”)(4).cast(“ float”))。withColumn(“ product_image”,split($“ value”,“ \ |”)(5).cast(“ string”) ).select(“ product_id”,“ product_cat_id”,“ product_name”,“ product_description”,“ product_price”,“ product_image”)

答案 1 :(得分:0)

  

Spark 2.4.3只需添加简洁明了的代码

scala> var df =Seq(("1009|45|Diamond F"),("1010|46|DBX Vecto")).toDF("value")

scala> df.show
+-----------------+
|            value|
+-----------------+
|1009|45|Diamond F|
|1010|46|DBX Vecto|
+-----------------+
val splitedViewsDF = df.withColumn("product_id", split($"value", "\\|").getItem(0)).withColumn("product_cat_id", split($"value", "\\|").getItem(1)).withColumn("product_name", split($"value", "\\|").getItem(2)).drop($"value")

scala> splitedViewsDF.show
+----------+--------------+------------+
|product_id|product_cat_id|product_name|
+----------+--------------+------------+
|      1009|            45|   Diamond F|
|      1010|            46|   DBX Vecto|
+----------+--------------+------------+

在这里,您可以使用getItem获取数据。快乐的Hadoop