在两个数据帧上连接时,我得到重复数据,其中一个键是十进制,另一个是字符串。看来Spark正在将小数转换为字符串,从而产生科学计数法表达式,但随后以十进制形式显示原始结果就好了。我通过直接转换为字符串找到了一种解决方法,但这似乎很危险,因为在没有警告的情况下创建了重复项。 这是错误吗?如何检测到何时发生?
这是Spark 2.4上的pyspark演示:
>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import *
>>> df1 = spark.createDataFrame([('a', 9223372034559809871), ('b', 9223372034559809771)], ['group', 'id_int'])
>>> df1=df1.withColumn('id',col('id_int').cast(DecimalType(38,0)))
>>>
>>> df1.show()
+-----+-------------------+-------------------+
|group| id_int| id|
+-----+-------------------+-------------------+
| a|9223372034559809871|9223372034559809871|
| b|9223372034559809771|9223372034559809771|
+-----+-------------------+-------------------+
>>>
>>> df2= spark.createDataFrame([(1, '9223372034559809871'), (2, '9223372034559809771')], ['value', 'id'])
>>> df2.show()
+-----+-------------------+
|value| id|
+-----+-------------------+
| 1|9223372034559809871|
| 2|9223372034559809771|
+-----+-------------------+
>>>
>>> df1.join(df2, ["id"]).show()
+-------------------+-----+-------------------+-----+
| id|group| id_int|value|
+-------------------+-----+-------------------+-----+
|9223372034559809871| a|9223372034559809871| 1|
|9223372034559809871| a|9223372034559809871| 2|
|9223372034559809771| b|9223372034559809771| 1|
|9223372034559809771| b|9223372034559809771| 2|
+-------------------+-----+-------------------+-----+
>>> df1.dtypes
[('group', 'string'), ('id_int', 'bigint'), ('id', 'decimal(38,0)')]
答案 0 :(得分:0)
这是由于连接键中的值(非常非常大)而发生的:
我调整了连接条件中的值,它给了我正确的结果:
from pyspark.sql.types import *
df1 = spark.createDataFrame([('a', 9223372034559809871), ('b', 9123372034559809771)],
['group', 'id_int'])
df1=df1.withColumn('id',col('id_int').cast(DecimalType(38,0)))
df2= spark.createDataFrame([(1, '9223372034559809871'), (2, '9123372034559809771')],
['value', 'id'])
df1.join(df2, df1["id"]==df2["id"],"inner").show()