我需要在Pyspark中执行线性回归,我只是尝试遵循此链接的步骤:
https://towardsdatascience.com/building-a-linear-regression-with-pyspark-and-mllib-d065c3ba246a
就我而言,我只是使用以下代码将数据导入了数据块:
## Importation de données
# File location and type
file_location = "/FileStore/tables/Spark.csv"
file_type = "csv"
# CSV options
infer_schema = "false"
first_row_is_header = "True"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
然后,我需要创建一个汇编向量,该向量具有解释变量的值,而无需用变量作为目标,ID和_c0:
ignore = ['ID', 'target', '_c0']
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=[x for x in df.columns if x not in ignore],
outputCol='features')
我的问题是当我使用此命令时:
new_df = vectorAssembler.transform(df)
出现此错误:
IllegalArgumentException: u'Data type StringType of column is not supported
我是Spark的初学者,曾被搜索过很多次,但我真的无法理解这个问题,因为通常vectorAssembler.transform
只是将具有选定变量的新单向量添加到初始数据帧中?没有??请帮忙!