Question

我有一个带有许多列的pyspark数据框'data3'。我正在尝试在前两列以外的地方运行kmeans，当我运行代码时，任务总是在TypeError上失败：float（）参数必须是字符串或数字，而不是'NoneType'。我在做什么错了？

def f(x):
    rel = {}
    #rel['features'] = Vectors.dense(float(x[0]),float(x[1]),float(x[2]),float(x[3]))
    rel['features'] = Vectors.dense(float(x[2]),float(x[3]),float(x[4]),float(x[5]),float(x[6]),float(x[7]),float(x[8]),float(x[9]),float(x[10]),float(x[11]),float(x[12]),float(x[13]),float(x[14]),float(x[15]),float(x[16]),float(x[17]),float(x[18]),float(x[19]),float(x[20]),float(x[21]),float(x[22]),float(x[23]),float(x[24]),float(x[25]),float(x[26]),float(x[27]),float(x[28]),float(x[29]),float(x[30]),float(x[31]),float(x[32]),float(x[33]),float(x[34]),float(x[35]),float(x[36]),float(x[37]),float(x[38]),float(x[39]),float(x[40]),float(x[41]),float(x[42]),float(x[43]),float(x[44]),float(x[45]),float(x[46]),float(x[47]),float(x[48]),float(x[49]))
    return rel

data= data3.rdd.map(lambda p: Row(**f(p))).toDF()
kmeansmodel = KMeans().setK(7).setFeaturesCol('features').setPredictionCol('prediction').fit(data)

TypeError: float() argument must be a string or a number, not 'NoneType'

Answer 1

您的错误来自将x转换为浮点数，因为您可能缺少值

rel['features'] = Vectors.dense(float(x[2]),float(x[3]),float(x[4]),float(x[5]),float(x[6]),float(x[7]),float(x[8]),float(x[9]),float(x[10]),float(x[11]),float(x[12]),float(x[13]),float(x[14]),float(x[15]),float(x[16]),float(x[17]),float(x[18]),float(x[19]),float(x[20]),float(x[21]),float(x[22]),float(x[23]),float(x[24]),float(x[25]),float(x[26]),float(x[27]),float(x[28]),float(x[29]),float(x[30]),float(x[31]),float(x[32]),float(x[33]),float(x[34]),float(x[35]),float(x[36]),float(x[37]),float(x[38]),float(x[39]),float(x[40]),float(x[41]),float(x[42]),float(x[43]),float(x[44]),float(x[45]),float(x[46]),float(x[47]),float(x[48]),float(x[49]))
return rel

您可以创建一个标志，以在缺少值时将每个x转换为浮点型。例如

list_of_Xs = [x[2], x[3], x[4], x[5], x[6],etc. ]
for x in list_of_Xs:
    if x is not None:
        x = float(x)

或使用rel.dropna()

在``无类型''对象上转换pyspark数据框失败

1 个答案: