python - 如何从pyspark数据框列中删除软连字符或长破折号

无法拆分长破折号的Experience_datesEmployeed列。如何分割字符串或如何从列值中删除长破折号。

我尝试使用UTF-8编码读取文件。

df_final=spark.read.options(header="True",inferSchema='True',delimiter=',').option("encoding", "UTF-8").csv("/path/csv")

试图用Unicode拆分，例如8212,8211,2014。

df_final.withColumn('Splitted', split(df_final['Experience_datesEmployeed'], u'\u2014')[0]).show()

示例CSV文件

  fullName,Experience_datesEmployeed,Experience_expcompany,Experience_expduraation, Experience_position
    David,Feb 1999 – Sep 2001, Foothill,2 yrs 8 mos, Marketing Assoicate
    David,1994 – 1997, abc,3 yrs,Senior Auditor
    David,Jun 2020 – Present,   Fellows INC,3 mos,Director Board
    David,2017 – Jun 2019,     Fellows INC ,2 yrs,Fellow - Class 22
    David,Sep 2001 – Present, The John D.,19 yrs, Manager

如何从pyspark数据框列中删除软连字符或长破折号

0 个答案: