添加一个名为Download_Type的新列,条件是:
如果大小<100,000,则Download_Type =“小”
如果大小> 100,000,大小<1,000,000,则Download_Type =“中”
其他Download_Type =“大”
输入数据:log_file.txt
样本数据 “ date”,“ time”,“ size”,“ r_version”,“ r_arch”,“ r_os”,“ package”,“ version”,“ country”,“ ip_id” “ 2012-10-01”,“ 00:30:13”,35165,“ 2.15.1”,“ i686”,“ linux-gnu”,“ quadprog”,“ 1.5-4”,“ AU”,1 < / p>
我使用以下步骤创建了一个数据框:
.08
我隔离了size列并将其转换为数组:
val file1 = sc.textFile(“log_file.txt”)
val header = file1.first
val logdata = file1.filter(x=>x!=header)
case class Log(date:String, time:String, size: Double, r_version:String, r_arch:String, r_os:String, packagee:String, version:String, country:String, ipr:Int)
val logfiledata = logdata.map(_.split(“,”)),map(p=>Log(p(0),p(1),p(2).toDouble,p(3),p(4),p(5),p(6),p(7),p(8),p(9).toInt))
val logfiledf = logfiledata.toDF()
我做了一个函数,所以我可以填充新添加的列:
val size = logfiledf.select($"size")
val sizearr = size.collect.map(row=>row.getDouble(0))
我试图以此填充“下载类型”列:
def exp1(size:Array[Double])={
var result = ""
for(i <- 0 to (size.length-1)){
if(size(i)<100000) result += "small"
else(if(size(i) >=100000 && size(i) <1000000) "medium"
else "large"
}
return result
}
如何使用以下条件填充名为Download_type的新列:
如果大小<100,000,则Download_Type =“小”
如果大小> 100,000,大小<1,000,000,则Download_Type =“中”
其他Download_Type =“大”吗?
答案 0 :(得分:3)
您只需使用withColumn
将logfiledf
应用于加载的DataFrame when/otherwise
,如下所示:
import org.apache.spark.sql.functions._
import spark.implicits._
val logfiledf = Seq(
("2012-10-01","00:30:13",35165.0,"2.15.1","i686","linux-gnu","quadprog","1.5-4","AU",1),
("2012-10-02","00:40:14",150000.0,"2.15.1","i686","linux-gnu","quadprog","1.5-4","US",2)
).toDF("date","time","size","r_version","r_arch","r_os","package","version","country","ip_id")
logfiledf.withColumn("download_type", when($"size" < 100000, "Small").otherwise(
when($"size" < 1000000, "Medium").otherwise("Large")
)
).show
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+
// | date| time| size|r_version|r_arch| r_os| package|version|country|ip_id|download_type|
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+
// |2012-10-01|00:30:13| 35165.0| 2.15.1| i686|linux-gnu|quadprog| 1.5-4| AU| 1| Small|
// |2012-10-02|00:40:14|150000.0| 2.15.1| i686|linux-gnu|quadprog| 1.5-4| US| 2| Medium|
// +----------+--------+--------+---------+------+---------+--------+-------+-------+-----+-------------+