Question

在PySpark 1.6 DataFrame中，目前没有Spark内置函数可以从字符串转换为float / double。

假设我们有一个带有（'house_name'，'price'）的RDD，两个值都是字符串。您想转换，价格从字符串到浮动。在PySpark中，我们可以应用map和python float函数来实现这一点。

New_RDD =  RawDataRDD.map(lambda (house_name, price): (house_name, float(x.price))    #it works

在PySpark 1.6 Dataframe中，它不起作用：

New_DF = rawdataDF.select('house name', float('price')) #did not work

在内置Pyspark功能之前，如何使用UDF实现此转换？我按如下方式开发了这个转换UDF：

from pyspark.sql.functions import udf

from pyspark.sql.types import StringType

def string_to_float(x):return float(x)

udfstring_to_float = udf(string_to_float, StringType())

rawdata.withColumn("house name",udfstring_to_float("price") )

有没有更好，更简单的方法来实现同样的目标？

Answer 1

根据documentation，您可以在列上使用cast函数，如下所示：

rawdata.withColumn("house name",rawdata["price"].cast(DoubleType().alias("price"))

Answer 2

答案应该如下：

>>> rawdata.printSchema()
root
 |-- house name: string (nullable = true)
 |-- price: string (nullable = true)

>>> rawdata=rawdata.withColumn('price',rawdata['price'].cast("float").alias('price'))

>>> rawdata.printSchema()
root
 |-- house name: string (nullable = true)
 |-- price: float (nullable = true)

这是最短的单行代码，不使用任何用户定义的功能。您可以使用printSchema()函数查看它是否正常工作。

PySpark 1.6：DataFrame：将一列从字符串转换为float / double

2 个答案: