Question

我在pySpark工作，我有一个变量LATITUDE，它有很多小数位。我需要从中创建两个新变量，一个是圆形的，另一个是截断的。两个小数点后三位。

截断值的最简单方法是什么？

对于四舍五入，我做了：

raw_data = raw_data.withColumn("LATITUDE_ROUND", round(raw_data.LATITUDE, 3))

这似乎有效，但如果有更好的方法，请告诉我。

Answer 1

Spark 1.5.2

您可以简单地使用format_number(col,d)函数，该函数将数字输入四舍五入到textContainer小数位并将其作为lineBreakMode返回。在你的情况下：

Answer 2

尝试：

>>> from pyspark.sql.functions import pow, lit
>>> from pyspark.sql.types import LongType
>>>
>>> num_places = 3
>>> m = pow(lit(10), num_places).cast(LongType())
>>> df = sc.parallelize([(0.6643, ), (0.6446, )]).toDF(["x"])
>>> df.withColumn("trunc", (col("x") * m).cast(LongType()) / m).

Answer 3

您可以使用the floor() function。所以（没有测试）我建议：

var self = this;
socket.on('chatMessage' , function(msg) {
        self.messages.push(msg);
        console.log(self.messages);

        document.querySelector('message-list').messageList = self.messages;

    });

但请注意负值 - 例如https://math.stackexchange.com/questions/344815/how-do-the-floor-and-ceiling-functions-work-on-negative-numbers

Answer 4

这是使用简单的UDF的解决方案：

import pyspark.sql.functions as F
import math
@F.udf
def trunc_float(num,precision):
    return math.trunc(num*10**precision)

用法示例：

df.select(trunc_float(F.lit(0.0017),F.lit(3)))

收益0.001

Answer 5

使用具有python的Decimal类型的UDF。如果您有一个已经返回Decimal的UDF，但由于Python的Decimal可以大于PySpark（最大38,18），则需要避免溢出，这也很有用：

import pyspark.sql.functions as F
import pyspark.sql.types as T
import decimal as D

@F.udf(T.DecimalType(38,18))
def trunc_precision(val:D.Decimal,precision:int):
    return val.quantize(D.Decimal(10)**-1*precision)

PySpark截断小数

5 个答案: