我试图合并两列不同的数据类型。在下面的代码片段中,为了简单起见,我从同一数据框中选择列。
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import *
from datetime import datetime
a=sc.parallelize([('ship1',datetime(2015,1,1),2,3.,4.),('ship1',datetime(2015,1,2),4,8.,9.),('ship1',datetime(2015,1,3),5,39.,49.),('ship2',datetime(2015,1,4),2,3.,4.),('ship2',datetime(2015,1,5),4,4.,6.),('ship3',datetime(2015,1,15),33,56.,6.),('ship3',datetime(2015,1,12),3,566.,64.),('ship4',datetime(2015,1,5),3,3.,None)])
schemaString = "name time ROT SOG COG"
strtype=[StringType(),TimestampType(),IntegerType(),FloatType(),FloatType()]
fields = [StructField(schemaString.split()[i], strtype[i],True) for i in range(0,len(strtype))]
schema=StructType(fields)
df=sqlContext.createDataFrame(a,schema)
df.show()
+-----+--------------------+---+-----+----+
| name| time|ROT| SOG| COG|
+-----+--------------------+---+-----+----+
|ship1|2015-01-01 00:00:...| 2| 3.0| 4.0|
|ship1|2015-01-02 00:00:...| 4| 8.0| 9.0|
|ship1|2015-01-03 00:00:...| 5| 39.0|49.0|
|ship2|2015-01-04 00:00:...| 2| 3.0| 4.0|
|ship2|2015-01-05 00:00:...| 4| 4.0| 6.0|
|ship3|2015-01-15 00:00:...| 33| 56.0| 6.0|
|ship3|2015-01-12 00:00:...| 3|566.0|64.0|
|ship4|2015-01-05 00:00:...| 3| 3.0|null|
+-----+--------------------+---+-----+----+
当我从df中提取两列时,到新的DataFrames并尝试将它们与df.withColumn()
合并b=df.select("time")
c=df.select("SOG")
d=b.withColumn("SOG",c.SOG)
我明白了:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-48-4845b5dc1c80> in <module>()
4 c=aa.select("SOG")
5
----> 6 d=b.withColumn("SOG",c.SOG)
/hadoop-dist/spark/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)
1166 [Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
1167 """
-> 1168 return self.select('*', col.alias(colName))
1169
1170 @ignore_unicode_prefix
/hadoop-dist/spark/python/pyspark/sql/dataframe.pyc in select(self, *cols)
719 [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]
720 """
--> 721 jdf = self._jdf.select(self._jcols(*cols))
722 return DataFrame(jdf, self.sql_ctx)
723
/hadoop-dist/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:
/hadoop-dist/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(
Py4JJavaError: An error occurred while calling o231.select.
: org.apache.spark.sql.AnalysisException: resolved attribute(s) SOG#3 missing from time#1 in operator !Project [time#1,SOG#3 AS SOG#34];
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:121)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:98)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:42)
at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
at org.apache.spark.sql.DataFrame.select(DataFrame.scala:595)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
我是做傻事,还是某种虫子?
(我能够合并类似数据类型的列)
答案 0 :(得分:7)
你想要在这里实现的目标根本不受支持。使用DataFrame.withColumn
时,列表达式只能引用给定数据框中的列。无法根据另一个表中的数据添加列。
如果要合并多个数据框中的列,则必须使用join
。这意味着您需要一个自然键,可用于根据行号匹配相应或已执行的连接。只有当您可以确保两个数据帧中的行顺序完全相同时,才可能使用后者。