如何使用正则表达式从 pyspark 数据框列中删除双引号

时间:2021-06-20 14:03:33

标签: json dataframe apache-spark pyspark

我正在从具有 json 数据的文本文件创建一个数据框(df)。创建数据框后看起来像这样。

+------------------------------------------------------------------------------------+
|data                                                                                |
+------------------------------------------------------------------------------------+
|"{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"        |
|"{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PASEO COSTA DEL SUR","State":"PR"}"|
+------------------------------------------------------------------------------------+

我想去掉列数据开头和结尾的双引号。所以最终的数据框应该是这样的

+------------------------------------------------------------------------------------+
|data                                                                                |
+------------------------------------------------------------------------------------+
|{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}          |
|{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PASEO COSTA DEL SUR","State":"PR"}  |
+------------------------------------------------------------------------------------+

下面是我写的从开头删除双引号的代码

df = df.withColumn('data1', F.regexp_replace("data",'^\"{\"','{\"'))

但我收到此错误

^"{" ^ 在 java.util.regex.Pattern.error(Pattern.java:1957)

你能帮我解决这个问题吗?

1 个答案:

答案 0 :(得分:2)

你只需要稍微调整一下你的正则表达式。不需要转义引号,但需要转义大括号:

df2 = df.withColumn('data1', F.regexp_replace("data",'^"\{"','{"'))

df2.show(truncate=False)
+------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------+
|data                                                                                |data1                                                                              |
+------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------+
|"{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"        |{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"        |
|"{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PASEO COSTA DEL SUR","State":"PR"}"|{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PASEO COSTA DEL SUR","State":"PR"}"|
+------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------+