Question

我在python上使用spark。上传csv文件后，我需要解析一个csv文件中的列，该文件的数字长度为22位。为解析该列，我使用 LongType（）。我使用map（）函数来定义列。以下是我在pyspark的命令。

>>> test=sc.textFile("test.csv")
>>> header=test.first()
>>> schemaString = header.replace('"','')
>>> testfields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(',')]
>>> testfields[5].dataType = LongType()
>>> testschema = StructType(testfields)
>>> testHeader = test.filter(lambda l: "test_date" in l)
>>> testNoHeader = test.subtract(testHeader)
>>> test_temp = testNoHeader.map(lambda k: k.split(",")).map(lambda
p:(p[0],p[1],p[2],p[3],p[4],***float(p[5].strip('"'))***,p[6],p[7]))
>>> test_temp.top(2)

注意：我也尝试了很长时间的＆＃39;和＆＃39; bigint＆＃39;代替＆＃39;浮动＆＃39;在我的变量 test_temp 中，但是火花中的错误是找不到＆＃39;关键字＆＃39; 以下是输出

[('2012-03-14', '7', '1698.00', 'XYZ02abc008793060653', 'II93', ***8.27370028700801e+21*** , 'W0W0000000000007', '879870080088815007'), ('2002-03-14', '1', '999.00', 'ABC02E000050086941', 'II93', 8.37670028702205e+21, 'A0B0080000012523', '870870080000012421')]

我的csv文件中的值如下： 8.27370028700801e + 21 8273700287008010012345 8.37670028702205e + 21 8376700287022050054321

当我从中创建数据框然后查询它时，

>>> test_df = sqlContext.createDataFrame(test_temp, testschema)
>>> test_df.registerTempTable("test")
>>> sqlContext.sql("SELECT test_column FROM test").show()

test_column给出值＆＃39; null＆＃39;对于所有记录。

那么，如何解决这个解析spark中大数字的问题，真的很感谢你的帮助

Answer 1

嗯，类型很重要。由于您将数据转换为float，因此无法在LongType中使用DataFrame。它并不仅仅是因为PySpark在类型方面相对宽容。

另外，8273700287008010012345很大，表示为LontType，只能表示-9223372036854775808和9223372036854775807之间的值。

如果您希望将数据发送到DataFrame，则必须使用DoubleType：

from pyspark.sql.types import *

rdd = sc.parallelize([(8.27370028700801e+21, )])
schema = StructType([StructField("x", DoubleType(), False)])
rdd.toDF(schema).show()

## +-------------------+
## |                  x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+

通常最好直接使用DataFrames处理此问题：

from pyspark.sql.functions import col

str_df = sc.parallelize([("8273700287008010012345", )]).toDF(["x"])
str_df.select(col("x").cast("double")).show()

## +-------------------+
## |                  x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+

如果您不想使用Double，则可以指定精确度转换为Decimal：

str_df.select(col("x").cast(DecimalType(38))).show(1, False)

## +----------------------+
## |x                     |
## +----------------------+
## |8273700287008010012345|
## +----------------------+

用于处理pyspark中的大数字的数据类型

1 个答案: