Spark for Python - 无法将字符串列强制转换为decimal / double

时间:2017-10-25 10:02:07

标签: apache-spark pyspark apache-spark-sql pyspark-sql

在针对此操作发布的所有问题中,我找不到有用的内容。

我正在尝试多个版本,在所有版本中我都有DataFrame

   
dataFrame = spark.read.format("com.mongodb.spark.sql").load()

dataFrame.printSchema()的打印输出

root
 |-- SensorId: string (nullable = true)
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- _type: string (nullable = true)
 |-- device: string (nullable = true)
 |-- deviceType: string (nullable = true)
 |-- event_id: string (nullable = true)
 |-- gen_val: string (nullable = true)
 |-- lane_id: string (nullable = true)
 |-- system_id: string (nullable = true)
 |-- time: string (nullable = true)

创建DataFrame后,我想将'gen_val'列(存储在变量results.inputColumns中)从String类型转换为Double类型。不同的版本导致了不同的错误。

版本#1

代码:

dataFrame = dataFrame.withColumn(results.inputColumns, dataFrame[results.inputColumns].cast('double'))

使用cast(DoubleType())代替,会产生相同的错误

错误:

AttributeError: 'DataFrame' object has no attribute 'cast'

版本#2

代码:

dataFrame = dataFrame.withColumn(results.inputColumns, dataFrame['gen_val'].cast('double'))

即使这个选项并不是真的很重要,因为参数不能硬编码......

错误:

dataFrame = dataFrame.withColumn(results.inputColumns, dataFrame['gen_val'].cast('double'))
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1502, in withColumn
  File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o31.withColumn. Trace:
py4j.Py4JException: Method withColumn([class java.util.ArrayList, class org.apache.spark.sql.Column]) does not exist
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
        at py4j.Gateway.invoke(Gateway.java:272)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)

感谢您的帮助

2 个答案:

答案 0 :(得分:1)

目前还不是很清楚你要做什么; withColumn的第一个参数应该是一个数据帧列名,可以是现有的(要修改的)或新的(要创建的),而(至少在你的版本1中)你使用它就像{ {1}}已经是一个列(不是)。

在任何情况下,将字符串转换为double类型是直截了当的;这是一个玩具示例:

   
results.inputColums

在你的情况下,这应该做的工作:

spark.version
# u'2.2.0'

from pyspark.sql.types import DoubleType

df = spark.createDataFrame([("foo", '1'), ("bar", '2')], schema=['A', 'B'])
df
# DataFrame[A: string, B: string]
df.show()
# +---+---+ 
# |  A|  B|
# +---+---+
# |foo|  1| 
# |bar|  2|
# +---+---+

df2 = df.withColumn('B', df['B'].cast('double'))
df2.show()
# +---+---+ 
# |  A|  B|
# +---+---+
# |foo|1.0| 
# |bar|2.0|
# +---+---+
df2
# DataFrame[A: string, B: double]

答案 1 :(得分:0)

我尝试了其他的东西并且它有效 - 而不是改变输入列数据,我创建了一个铸造/转换列。我认为效率较低,但这就是我目前所拥有的。

   
dataFrame = spark.read.format("com.mongodb.spark.sql").load()
col = dataFrame.gen_val.cast('double')
dataFrame = dataFrame.withColumn('doubled', col.cast('double'))
assembler = VectorAssembler(inputCols=["doubled"], outputCol="features")
output = assembler.transform(dataFrame)

张彤:这是dataFrame.printSchema()的打印输出:

root
 |-- SensorId: string (nullable = true)
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- _type: string (nullable = true)
 |-- device: string (nullable = true)
 |-- deviceType: string (nullable = true)
 |-- event_id: string (nullable = true)
 |-- gen_val: string (nullable = true)
 |-- lane_id: string (nullable = true)
 |-- system_id: string (nullable = true)
 |-- time: string (nullable = true)

无论如何,这是一个非常基本的转变,在(近)未来,我需要做更复杂的转变。如果你们中的任何人都知道有关使用spark和Python进行Dataframes转换的好例子,说明或文档,我将不胜感激。