我有一个spark数据框,其中包含json文件的内容。我需要创建一个新列,该新列将根据另一列的内容有条件地填充。
假设我有一列包含一些数字,并且我的新列将根据该数字的值进行填充(例如:第一列的数字小于5,我的新列将填充字符串'lower大于5”,如果该值大于5,则新列将填充“大于5”。
我知道我可以使用when函数来做类似的事情:
file.withColumn('newcolumn', \
F.when(file.oldColumn < 5, 'Lower than five') \
.when(file.oldColumn > 5, 'Greater than five').show()
但是如果'oldColumn'不只是整数,而是包含我需要从中提取整数的字符串怎么办?
例如'PT5M',我需要提取5,我需要考虑类似'PTM'的字符串,其中不包含数字
到目前为止,我设法使用regexp_extract提取第一列的数字,但我正在努力将null值设为0
示例,其中1是原始列,而2是新列:
+-------+-------------------+
|1 | 2 |
+-------+-------------------+
|PT5M | Lower than five |
|PT10M | Greater than five|
|PT11M | Greater than five|
+-------+-------------------+
感谢您的帮助!
答案 0 :(得分:1)
使用when
用空字符串替换非数字,然后使用file.withColumn('newcolumn', \
F.when(F.regexp_replace(file.oldColumn,'[^0-9]','') == '','Lower than five')\
.when(F.regexp_replace(file.oldColumn,'[^0-9]','').cast('int') < 5, 'Lower than five') \
.otherwise('Greater than five')).show()
设置列值。
public class Program
{
public static void Main(string[] args)
{
CreateWebHostBuilder(args).Build()
.Run();
}
public static IWebHostBuilder CreateWebHostBuilder(string[] args) =>
WebHost.CreateDefaultBuilder(args)
.UseStartup<Startup>();
}
答案 1 :(得分:0)
有很多方法
scala> val df = Seq("PT5M","PT10M","PT11M").toDF("a")
df: org.apache.spark.sql.DataFrame = [a: string]
scala> df.show(false)
+-----+
|a |
+-----+
|PT5M |
|PT10M|
|PT11M|
+-----+
scala> df.withColumn("b",regexp_extract('a,"""\D*(\d+)\D*""",1)).show(false)
+-----+---+
|a |b |
+-----+---+
|PT5M |5 |
|PT10M|10 |
|PT11M|11 |
+-----+---+
scala> df.withColumn("b",regexp_extract('a,"""\D*(\d+)\D*""",1)).withColumn("c", when('b.cast("int") < 5, "Lower than five").when('b.cast("int") > 5, "Greater than five").otherwise("null")).show(false)
+-----+---+-----------------+
|a |b |c |
+-----+---+-----------------+
|PT5M |5 |null |
|PT10M|10 |Greater than five|
|PT11M|11 |Greater than five|
+-----+---+-----------------+
scala>
如果值中没有数字,并且您希望将其默认设置为0,则可以使用coalesce()
scala> val df = Seq("PT5M","PT10M","PT11M", "XXM").toDF("a")
df: org.apache.spark.sql.DataFrame = [a: string]
scala> df.show
+-----+
| a|
+-----+
| PT5M|
|PT10M|
|PT11M|
| XXM|
+-----+
scala> df.withColumn("b",coalesce(regexp_extract('a,"""\D*(\d+)\D*""",1).cast("int"),lit(0))).withColumn("c", when('b < 5, "Lower than five").when('b > 5, "Greater than five").otherwise("null")).show(false)
+-----+---+-----------------+
|a |b |c |
+-----+---+-----------------+
|PT5M |5 |null |
|PT10M|10 |Greater than five|
|PT11M|11 |Greater than five|
|XXM |0 |Lower than five |
+-----+---+-----------------+
scala>
答案 2 :(得分:0)
from pyspark.sql.functions import regexp_extract, when
myValues = [('PT5M',),('PT10M',),('PT11M',),('PT',)]
df = sqlContext.createDataFrame(myValues,['1'])
df.show()
+-----+
| 1|
+-----+
| PT5M|
|PT10M|
|PT11M|
| PT|
+-----+
df = df.withColumn('interim',regexp_extract(df['1'],'\d+',0))
df = df.withColumn('2', when(df['interim'] < 5, 'Lower than five').when(df['interim'] > 5, 'Greater than five').when(df['interim']=='','Lower than five')).drop('interim')
df.show()
+-----+-----------------+
| 1| 2|
+-----+-----------------+
| PT5M| null|
|PT10M|Greater than five|
|PT11M|Greater than five|
| PT| Lower than five|
+-----+-----------------+