我试图在 pyspark 数据框列中拆分字符串,名称和标题由不同的分隔符分隔,格式不同。例如:
+---+------------------------------------------------------------------------------------+
| id| Text |
| 0| first name last name (title), first name last name (title), and first name (title) | |
| 1| title: first name last name title: first name last name |
| 2| first name last name: title. first name last name: title. |
如何将其放入仅包含名称的列中?
期望的输出
| id| Text |
| 0| first name last name | |
| 1| first name last name |
| 2| first name last name |
我尝试了以下方法但没有用
test = table.withColumn('column_new', f.split(f.regexp_replace('column', '[(:;)]', r"$1,"), ","))
提前致谢。