在PySpark中使用VectorAssembler转义双引号列名

时间:2019-12-12 11:46:01

标签: apache-spark pyspark pyspark-sql pyspark-dataframes

致力于对我从Kaggle下载的数据实施基本的信用卡欺诈检测算法。

列名似乎用双引号表示,例如:“时间”,“ V1”,“ V2”,“ V3”,“ V4”,“ V5”,“ V6”,“ V7”,“ V8”, “ V9”,“ V10”,“ V11”,“ V12”,“ V13”,“ V14”,“ V15”,“ V16”,“ V17”,“ V18”,“ V19”,“ V20”,“ V21” “,” V22“ ....

我想要做的是使用VectorAssembler将所有列组合为MLlib的功能列,如下所示:

assembler = VectorAssembler(
    inputCols=["Time","V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","Amount"],
    outputCol="features")

output = assembler.transform(df)

但是得到错误:

IllegalArgumentException: 'Field "Time" does not exist.\nAvailable fields: "Time","V1","V2","V3","

我意识到这是由于双引号的列名所致,因为我尝试使用以下方式更改单个列名:

 df1 = df.selectExpr("""'Time' as test""")

然后工作了。但是,鉴于在此示例中我有30个,而在下一个示例中我可能会更多,因此似乎无法对所有列上的选择进行“硬编码”。

我尝试了所有可能的语法,即:

 inputCols=['"Time"']
 inputCols=["'Time'"]
 inputCols=[""Time""]
 inputCols=["""Time"""]
 inputCols=["´Time´"]

但是都给出相同的错误。有什么解决方案吗?还是我应该对select语句进行硬编码以重命名列?

0 个答案:

没有答案