Question

我的源文件是从Amazon redshift数据库中卸载的。我使用UNLOAD命令提取了数据。我的数据中有一列包含带有Windows换行符（\ r \ n）的自由格式文本，并且将引号（“）用作好吧。

但是redshift仅给ADDQUOTES提供了选项，却没有提供一种选择引号字符的方式，而ESCAPE就是这种情况。它们的实现在以下所有字符之前添加了转义字符（\）。

换行符：\ n
回车：\ r
为卸载的数据指定的分隔符。逃亡字符：\
引号字符：“或”（如果同时指定了ESCAPE和ADDQUOTES 在UNLOAD命令中）。

更多信息（https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html）

因此，卸载的数据在每个Windows换行符之前都带有转义符，例如“ \\ r \\ n”。

当我尝试使用带有escape ='\\'选项的spark.read.csv（）读取此文件时，它不会删除在\ r和\ n前面添加的escape（\）字符。 / p>

我了解到，只有当选定的引号字符作为引号数据字符串的一部分出现时，spark才会考虑转义。

我可以在将其读入数据框后将其删除。但是，在读入数据框时仍可以删除数据中的其他转义（\）字符吗？

感谢您的帮助！

示例记录（在\ r \ n之前带有转义字符）：

1,"this is \^M\
 line1"
2,"this is \^M\
 li\"ne2"
3,"this is \^M\
 line3"

这是redshift卸载文件插入转义字符的方式。引号字符的前面如果是数据的一部分并且分别位于\ r和\ n之前。

当我将此文件读入数据帧时，spark会正确删除\ n和quote（“）之前的转义字符，但保留\ r的开头。

>>> df2 = spark.read.csv("file:///tmp/sample_modified.csv",header=False,quote='"',sep=',',escape='\\',multiLine=True,inferSchema=False)
>>> df2.show(5,False)
+---+-------------------+
|_c0|_c1                |
+---+-------------------+
\1  |this is \
 line1 |
\2  |this is \
 li"ne2|
\3  |this is \
 line3 |
+---+-------------------+

预期结果（不带转义符“ \”）：

+---+----------------+
|_c0|_c1             |
+---+----------------+
|1  |this is 
 line1|
|2  |this is 
li"ne2|
|3  |this is 
line3 |
+---+----------------+

PS -由于这似乎是一个限制，所以我在Apache Spark项目中打开了JIRA问题。（https://issues.apache.org/jira/browse/SPARK-26786）

Answer 1

尝试一下。

这是cygwin中文件的外观

$ cat -vT vishsnu.csv
"ID","Desc"
1001,"this ^M
 is line1"
1002,"this ^M
 is line2"
1003,"this ^M
 is line3"
$

火花代码

val df = spark.read.format("csv")
            .option("wholeFile", "true")
  .option("multiLine","true")
            .option("inferSchema","true")
            .option("header","true")
           // .option("escape","""\""")  this is commented
            .load("in_201901/vishsnu.csv")

df.show(false)
df.select("desc").show(false)
println("Count of dataframe records " + df.count)

结果：

+----+---------------+
|ID  |Desc           |
+----+---------------+
|1001|this 
 is line1|
|1002|this 
 is line2|
|1003|this 
 is line3|
+----+---------------+

+---------------+
|desc           |
+---------------+
|this 
 is line1|
|this 
 is line2|
|this 
 is line3|
+---------------+

Count of dataframe records 3

regex_replace函数无助于删除\ r \ n字符。但是翻译功能做到了。见下文

  df.withColumn("desc2",translate(translate('desc,"\r",""),"\n", "")).select('id,'desc2).show(false)

结果

+----+--------------+
|id  |desc2         |
+----+--------------+
|1001|this  is line1|
|1002|this  is line2|
|1003|this  is line3|
+----+--------------+

Answer 2

下面的代码可以正常工作。

df=df2.select(*(regexp_replace(col(c),"\\\\\r\\\\\n","\r\n").alias(c) for c in df2.columns))

>>> df.show()
+---+-----------------+
|_c0|              _c1|
+---+-----------------+
|  1| this is line1   |
|  2| this is li"ne2  |
|  3| this is li\ne3  |
+---+-----------------+

在火花csv中处理\ r \ n的转义

2 个答案: