如何清理CSV文件以双引号作为一列读取文本

时间:2018-01-03 22:35:27

标签: bash scala csv apache-spark

我正在使用包含Scala和Apache Spark的芝加哥犯罪数据的数据集

有几行用逗号分隔多个值并用双引号括起来。有没有办法清理数据,以便双引号下的文本可以读作一列

文本在下面,粗体列是我想要读作的单列

10366565,HZ102660,01/03/2016 01:50:00 PM,020XX S WABASH AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,**“SCHOOL, PRIVATE, BUILDING”**,false,false,0131,001,3,33,14,1177070,1890608,2016,01/10/2016 08:46:55 AM,41.855167994,-87.625552607,"(41.855167994, -87.625552607)"

所需的输出将如下所示,以便通过删除逗号将引用下的文本作为单个字符串读取:

10366565,HZ102660,01/03/2016 01:50:00 PM,020XX S WABASH AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,**“SCHOOL|PRIVATE|BUILDING”**,false,false,0131,001,3,33,14,1177070,1890608,2016,01/10/2016 08:46:55 AM,41.855167994,-87.625552607,**"(41.855167994|-87.625552607)"**

有没有办法在Scala中执行它或使用shell脚本将其转换为新文件?

1 个答案:

答案 0 :(得分:0)

默认情况下,Spark会将CSV文件中的引用字符串(无论是否包含逗号)作为单个列,因此如果您希望在将其引入DataFrame后处理引用的内容:

示例CSV数据:

10366565,01/03/2016 01:50:00 PM,"SCHOOL, PRIVATE, BUILDING"
10366700,01/04/2016 12:30:00 PM,"SCHOOL, PRIVATE, BUILDING"

示例代码:

val df = spark.read.csv("/path/to/csvfile")

+--------+----------------------+-------------------------+
|_c0     |_c1                   |_c2                      |
+--------+----------------------+-------------------------+
|10366565|01/03/2016 01:50:00 PM|SCHOOL, PRIVATE, BUILDING|
|10366700|01/04/2016 12:30:00 PM|SCHOOL, PRIVATE, BUILDING|
+--------+----------------------+-------------------------+

// A UDF function that converts ",\s*" to "|"
def commaToPipe = udf( (s: String) =>
  """,\s*""".r.replaceAllIn(s, "|")
)

df.select($"_c0", commaToPipe($"_c2")).show(truncate=false)
+--------+-----------------------+
|_c0     |UDF(_c2)               |
+--------+-----------------------+
|10366565|SCHOOL|PRIVATE|BUILDING|
|10366700|SCHOOL|PRIVATE|BUILDING|
+--------+-----------------------+

[UPDATE]

正如评论者指出的那样,使用regexp_replace将消除对UDF的需求:

df.select($"_c0", regexp_replace($"_c2", """,\s*""", "|"))