根据使用另一列的正则表达式提取的内容,有条件地在spark数据框中填充新列

时间:2019-01-24 16:34:34

标签: python apache-spark pyspark

我有一个spark数据框,其中包含json文件的内容。我需要创建一个新列,该新列将根据另一列的内容有条件地填充。

假设我有一列包含一些数字,并且我的新列将根据该数字的值进行填充(例如:第一列的数字小于5,我的新列将填充字符串'lower大于5”,如果该值大于5,则新列将填充“大于5”。

我知道我可以使用when函数来做类似的事情:

file.withColumn('newcolumn', \
                F.when(file.oldColumn < 5, 'Lower than five') \
                .when(file.oldColumn > 5, 'Greater than five').show()

但是如果'oldColumn'不只是整数,而是包含我需要从中提取整数的字符串怎么办?

例如'PT5M',我需要提取5,我需要考虑类似'PTM'的字符串,其中不包含数字

到目前为止,我设法使用regexp_extract提取第一列的数字,但我正在努力将null值设为0

示例,其中1是原始列,而2是新列:

+-------+-------------------+
|1      |  2                |
+-------+-------------------+
|PT5M   |  Lower than five  |   
|PT10M  |  Greater than five|    
|PT11M  |  Greater than five|        
+-------+-------------------+

感谢您的帮助!

3 个答案:

答案 0 :(得分:1)

使用when用空字符串替换非数字,然后使用file.withColumn('newcolumn', \ F.when(F.regexp_replace(file.oldColumn,'[^0-9]','') == '','Lower than five')\ .when(F.regexp_replace(file.oldColumn,'[^0-9]','').cast('int') < 5, 'Lower than five') \ .otherwise('Greater than five')).show() 设置列值。

public class Program
{
    public static void Main(string[] args)
    {
        CreateWebHostBuilder(args).Build()
        .Run();
    }

    public static IWebHostBuilder CreateWebHostBuilder(string[] args) =>
        WebHost.CreateDefaultBuilder(args)
            .UseStartup<Startup>();
}

答案 1 :(得分:0)

有很多方法

scala> val df = Seq("PT5M","PT10M","PT11M").toDF("a")
df: org.apache.spark.sql.DataFrame = [a: string]

scala> df.show(false)
+-----+
|a    |
+-----+
|PT5M |
|PT10M|
|PT11M|
+-----+

scala> df.withColumn("b",regexp_extract('a,"""\D*(\d+)\D*""",1)).show(false)
+-----+---+
|a    |b  |
+-----+---+
|PT5M |5  |
|PT10M|10 |
|PT11M|11 |
+-----+---+


scala> df.withColumn("b",regexp_extract('a,"""\D*(\d+)\D*""",1)).withColumn("c", when('b.cast("int") < 5, "Lower than five").when('b.cast("int") > 5, "Greater than five").otherwise("null")).show(false)
+-----+---+-----------------+
|a    |b  |c                |
+-----+---+-----------------+
|PT5M |5  |null             |
|PT10M|10 |Greater than five|
|PT11M|11 |Greater than five|
+-----+---+-----------------+


scala>

如果值中没有数字,并且您希望将其默认设置为0,则可以使用coalesce()

scala> val df = Seq("PT5M","PT10M","PT11M", "XXM").toDF("a")
df: org.apache.spark.sql.DataFrame = [a: string]

scala> df.show
+-----+
|    a|
+-----+
| PT5M|
|PT10M|
|PT11M|
|  XXM|
+-----+


scala> df.withColumn("b",coalesce(regexp_extract('a,"""\D*(\d+)\D*""",1).cast("int"),lit(0))).withColumn("c", when('b < 5, "Lower than five").when('b > 5, "Greater than five").otherwise("null")).show(false)
+-----+---+-----------------+
|a    |b  |c                |
+-----+---+-----------------+
|PT5M |5  |null             |
|PT10M|10 |Greater than five|
|PT11M|11 |Greater than five|
|XXM  |0  |Lower than five  |
+-----+---+-----------------+


scala>

答案 2 :(得分:0)

from pyspark.sql.functions import regexp_extract, when
myValues = [('PT5M',),('PT10M',),('PT11M',),('PT',)]
df = sqlContext.createDataFrame(myValues,['1'])
df.show()
+-----+
|    1|
+-----+
| PT5M|
|PT10M|
|PT11M|
|   PT|
+-----+

df = df.withColumn('interim',regexp_extract(df['1'],'\d+',0))
df = df.withColumn('2', when(df['interim'] < 5, 'Lower than five').when(df['interim'] > 5, 'Greater than five').when(df['interim']=='','Lower than five')).drop('interim')
df.show()
+-----+-----------------+
|    1|                2|
+-----+-----------------+
| PT5M|             null|
|PT10M|Greater than five|
|PT11M|Greater than five|
|   PT|  Lower than five|
+-----+-----------------+