我正在使用包含Scala和Apache Spark的芝加哥犯罪数据的数据集
有几行用逗号分隔多个值并用双引号括起来。有没有办法清理数据,以便双引号下的文本可以读作一列
文本在下面,粗体列是我想要读作的单列
10366565,HZ102660,01/03/2016 01:50:00 PM,020XX S WABASH AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,**“SCHOOL, PRIVATE, BUILDING”**,false,false,0131,001,3,33,14,1177070,1890608,2016,01/10/2016 08:46:55 AM,41.855167994,-87.625552607,"(41.855167994, -87.625552607)"
所需的输出将如下所示,以便通过删除逗号将引用下的文本作为单个字符串读取:
10366565,HZ102660,01/03/2016 01:50:00 PM,020XX S WABASH AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,**“SCHOOL|PRIVATE|BUILDING”**,false,false,0131,001,3,33,14,1177070,1890608,2016,01/10/2016 08:46:55 AM,41.855167994,-87.625552607,**"(41.855167994|-87.625552607)"**
有没有办法在Scala中执行它或使用shell脚本将其转换为新文件?
答案 0 :(得分:0)
默认情况下,Spark会将CSV文件中的引用字符串(无论是否包含逗号)作为单个列,因此如果您希望在将其引入DataFrame后处理引用的内容:
示例CSV数据:
10366565,01/03/2016 01:50:00 PM,"SCHOOL, PRIVATE, BUILDING"
10366700,01/04/2016 12:30:00 PM,"SCHOOL, PRIVATE, BUILDING"
示例代码:
val df = spark.read.csv("/path/to/csvfile")
+--------+----------------------+-------------------------+
|_c0 |_c1 |_c2 |
+--------+----------------------+-------------------------+
|10366565|01/03/2016 01:50:00 PM|SCHOOL, PRIVATE, BUILDING|
|10366700|01/04/2016 12:30:00 PM|SCHOOL, PRIVATE, BUILDING|
+--------+----------------------+-------------------------+
// A UDF function that converts ",\s*" to "|"
def commaToPipe = udf( (s: String) =>
""",\s*""".r.replaceAllIn(s, "|")
)
df.select($"_c0", commaToPipe($"_c2")).show(truncate=false)
+--------+-----------------------+
|_c0 |UDF(_c2) |
+--------+-----------------------+
|10366565|SCHOOL|PRIVATE|BUILDING|
|10366700|SCHOOL|PRIVATE|BUILDING|
+--------+-----------------------+
[UPDATE]
正如评论者指出的那样,使用regexp_replace
将消除对UDF的需求:
df.select($"_c0", regexp_replace($"_c2", """,\s*""", "|"))