我正在尝试从所有列中删除所有特殊字符。我正在使用以下命令:-
df_spark = spark_df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns])
df_spark1 = df_spark.select([F.col(col).alias(col.replace('%', '_')) for col in df_spark.columns])
df_spark = df_spark1.select([F.col(col).alias(col.replace(',', '_')) for col in df_spark1.columns])
df_spark1 = df_spark.select([F.col(col).alias(col.replace('(', '_')) for col in df_spark.columns])
df_spark2 = df_spark1.select([F.col(col).alias(col.replace(')', '_')) for col in df_spark1.columns])
有没有一种更简便的方法可以在一个命令中替换所有特殊字符(不仅是上面的5个字符)?我在Databricks上使用pyspark。
谢谢!
答案 0 :(得分:0)
在Python中与list comprehension
一起使用 re (正则表达式)模块。
Example:
df=spark.createDataFrame([('a b','ac','ac','ac','ab')],["i d","id,","i(d","i)k","i%j"])
df.columns
#['i d', 'id,', 'i(d', 'i)k', 'i%j']
import re
#replacing all the special characters using list comprehension
[re.sub('[\)|\(|\s|,|%]','',x) for x in df.columns]
#['id', 'id', 'id', 'ik', 'ij']
df.toDF(*[re.sub('[\)|\(|\s|,|%]','',x) for x in df.columns])
#DataFrame[id: string, id: string, id: string, ik: string, ij: string]
答案 1 :(得分:0)
您可以替换除A-z和0-9以外的任何字符
import re
df = df.select([F.col(col).alias(re.sub("[^0-9a-zA-Z$]+","",i)) for col in df.columns])
答案 2 :(得分:0)
downloadIcs203Data(): any {
this.helper.toggleSidebarVisibility(true);
this.apiService.downloadIcs203({downloadPDF : 1}).subscribe(data => {
location.href = data.data.icspath;
this.helper.toggleSidebarVisibility(false);
swal.fire(
'',
data.message,
'success'
);
}, (err: any) => {
this.helper.toggleSidebarVisibility(false);
swal.fire(
'Error',
err.error.message,
'error'
);
});
}
将标点符号和空格替换为 re.sub('[^\w]', '_', c)
下划线。
测试结果:
_
去除标点符号 + 用 from pyspark.sql import SparkSession
import re
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, 2, 3, 4)], [' 1', '%2', ',3', '(4)'])
df = df.toDF(*[re.sub('[^\w]', '_', c) for c in df.columns])
df.show()
# +---+---+---+---+
# | _1| _2| _3|_4_|
# +---+---+---+---+
# | 1| 2| 3| 4|
# +---+---+---+---+
代替空格:
_
答案 3 :(得分:-1)
也许这很有用-
// [^0-9a-zA-Z]+ => this will remove all special chars
spark.range(2).withColumn("str", lit("abc%xyz_12$q"))
.withColumn("replace", regexp_replace($"str", "[^0-9a-zA-Z]+", "_"))
.show(false)
/**
* +---+------------+------------+
* |id |str |replace |
* +---+------------+------------+
* |0 |abc%xyz_12$q|abc_xyz_12_q|
* |1 |abc%xyz_12$q|abc_xyz_12_q|
* +---+------------+------------+
*/
// if you don't want to remove some special char like $ etc, include it [^0-9a-zA-Z$]+
spark.range(2).withColumn("str", lit("abc%xyz_12$q"))
.withColumn("replace", regexp_replace($"str", "[^0-9a-zA-Z$]+", "_"))
.show(false)
/**
* +---+------------+------------+
* |id |str |replace |
* +---+------------+------------+
* |0 |abc%xyz_12$q|abc_xyz_12$q|
* |1 |abc%xyz_12$q|abc_xyz_12$q|
* +---+------------+------------+
*/