Question

我有一个数据框（input_dataframe），如下所示：

id        test_column
1           0.25
2           1.1
3           12
4           test
5           1.3334
6           12.0

我想添加一个列结果，如果test_column具有十进制值，则将值设置为1;如果test_column具有任何其他值，则将值设置为0。 test_column的数据类型是字符串。以下是预期产量：

id        test_column      result
1           0.25              1
2           1.1               1
3           12                0
4           test              0
5           1.3334            1
6           12.0              1

我有以下代码进行此操作：

import decimal
from pyspark.sql.types import IntType

def is_valid_decimal(s):
    try:            
        return (0 if decimal.Decimal(val)._isinteger() else 1)
    except decimal.InvalidOperation:
        return 0

# register the UDF for usage
sqlContext.udf.register("is_valid_decimal", is_valid_decimal, IntType())

# Using the UDF
df.withColumn("result", is_valid_decimal("test_column"))

但是，当十进制值如下所示时，此代码无效： 12.0或12.00或12.000 有没有办法可以在pyspark中实现？

Answer 1

你提到它是一个字符串列，所以，我厌倦了使用正则表达式。希望它有所帮助，

>>> from pyspark.sql import functions as F
>>> from pyspark.sql.types import IntegerType
>>> import re
>>> df = spark.createDataFrame([(1,'0.25'),(2,'1.1'),(3,'12'),(4,'test'),(5,'1.3334'),(6,'12.0')],['id','test_col'])
>>> df.show()
+---+--------+
| id|test_col|
+---+--------+
|  1|    0.25|
|  2|     1.1|
|  3|      12|
|  4|    test|
|  5|  1.3334|
|  6|    12.0|
+---+--------+
>>> udf1 = F.udf(lambda x : 1 if re.match('^\d*[.]\d*$',x) else 0,IntegerType())
>>> df = df.withColumn('result',udf1(df.test_col))
>>> df.show()
+---+--------+------+
| id|test_col|result|
+---+--------+------+
|  1|    0.25|     1|
|  2|     1.1|     1|
|  3|      12|     0|
|  4|    test|     0|
|  5|  1.3334|     1|
|  6|    12.0|     1|
+---+--------+------+

检查列是否具有适用于特殊情况的正确十进制数

1 个答案: