我将输入spark-dataframe
命名为df
,
+---------------+----------------+-----------------------+
|Main_CustomerID|126+ Concentrate|2.5 Ethylhexyl_Acrylate|
+---------------+----------------+-----------------------+
| 725153| 3.0| 2.0|
| 873008| 4.0| 1.0|
| 625109| 1.0| 0.0|
+---------------+----------------+-----------------------+
我需要从df
的列名中删除特殊字符,如下所示,
删除+
将空间替换为underscore
dot
替换为underscore
所以我的df
应该像
+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
| 725153| 3.0| 2.0|
| 873008| 4.0| 1.0|
| 625109| 1.0| 0.0|
+---------------+---------------+-----------------------+
使用Scala,我已经做到了,
var tableWithColumnsRenamed = df
for (field <- tableWithColumnsRenamed.columns) {
tableWithColumnsRenamed = tableWithColumnsRenamed
.withColumnRenamed(field, field.replaceAll("\\.", "_"))
}
for (field <- tableWithColumnsRenamed.columns) {
tableWithColumnsRenamed = tableWithColumnsRenamed
.withColumnRenamed(field, field.replaceAll("\\+", ""))
}
for (field <- tableWithColumnsRenamed.columns) {
tableWithColumnsRenamed = tableWithColumnsRenamed
.withColumnRenamed(field, field.replaceAll(" ", "_"))
}
df = tableWithColumnsRenamed
当我使用时,
for (field <- tableWithColumnsRenamed.columns) {
tableWithColumnsRenamed = tableWithColumnsRenamed
.withColumnRenamed(field, field.replaceAll("\\.", "_"))
.withColumnRenamed(field, field.replaceAll("\\+", ""))
.withColumnRenamed(field, field.replaceAll(" ", "_"))
}
我第一列的名称为126 Concentrate
,而不是126_Concentrate
但是我不希望3 for循环来代替。我可以找到解决方案吗?
答案 0 :(得分:4)
您可以按以下方式使用withColumnRenamed
regex replaceAllIn
和foldLeft
val columns = df.columns
val regex = """[+._, ]+"""
val replacingColumns = columns.map(regex.r.replaceAllIn(_, "_"))
val resultDF = replacingColumns.zip(columns).foldLeft(df){(tempdf, name) => tempdf.withColumnRenamed(name._2, name._1)}
resultDF.show(false)
应该给您
+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|725153 |3.0 |2.0 |
|873008 |4.0 |1.0 |
|625109 |1.0 |0.0 |
+---------------+---------------+-----------------------+
我希望答案会有所帮助
答案 1 :(得分:3)
df
.columns
.foldLeft(df){(newdf, colname) =>
newdf.withColumnRenamed(colname, colname.replace(" ", "_").replace(".", "_"))
}
.show
答案 2 :(得分:0)
在Java中,您可以使用df.columns()
遍历列名,并用string replaceAll(regexPattern, IntendedCharreplacement)
替换每个标题字符串
然后使用withColumnRenamed(headerName, correctedHeaderName)
重命名df
标头。
例如-
for (String headerName : dataset.columns()) {
String correctedHeaderName = headerName.replaceAll(" ","_").replaceAll("+","_");
dataset = dataset.withColumnRenamed(headerName, correctedHeaderName);
}
dataset.show();
答案 3 :(得分:0)
Pi带Ramesh的答案,这是一个可重复使用的函数,该函数使用currying语法和.transform()方法并使列变小写:
dataset = [[int(y != 0) for y in ds] for ds in dataset]
答案 4 :(得分:0)
我们可以在使用 replaceAll 为各个字符替换特殊字符后,通过将 column_name 映射到新名称来删除所有字符,并且这行代码使用 spark scala 进行了尝试和测试。
df.select(
df.columns
.map(colName => col(s"`${colName}`").as(colName.replaceAll("\\.", "_").replaceAll(" ", "_"))): _*
).show(false)