Question

我想使用Spark sql substring函数从第一列行中的字符串获取子字符串，同时将第二列行中的字符串的长度用作参数。

我该怎么做？

以下设置是可重复的。

import pyspark
from pyspark.sql import functions as F
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()    

df = spark.createDataFrame([('prefix body suffix','suffix',)], ['a', 'b',])

问题函数返回TypeError: 'Column' object is not callable：

df = df.withColumn('noSuffix',
        F.substring(
            str = F.col('a'),
            pos = 1,
            len = F.length('a') - F.length('b')))

以下方法有效，但是我仍然不能在substring函数中使用所得的整数。

df = df.withColumn('length', F.length('a') - F.length('b'))

使用substring_index函数的问题。

df = df.withColumn('outCol',
        F.substring_index( 
            F.col('a'),
            F.col('b'),
            1))

有没有一种方法，而无需创建rdd函数，然后使用df.rdd.map(rddFunction).toDF()？

在pyspark.sql子字符串函数中使用字符串长度作为参数

0 个答案: