通过提供更多细节来重新定义问题。
我有一个数据帧“dailyshow”架构是:
scala> dailyshow.printSchema
root
|-- year: integer (nullable = true)
|-- occupation: string (nullable = true)
|-- showdate: string (nullable = true)
|-- group: string (nullable = true)
|-- guest: string (nullable = true)
示例数据是:
scala> dailyshow.show(5)
+----+------------------+---------+------+----------------+
|year| occupation| showdate| group| guest|
+----+------------------+---------+------+----------------+
|1999| actor|1/11/1999|Acting| Michael J. Fox|
|1999| Comedian|1/12/1999|Comedy| Sandra Bernhard|
|1999|television actress|1/13/1999|Acting| Tracey Ullman|
|1999| film actress|1/14/1999|Acting|Gillian Anderson|
|1999| actor|1/18/1999|Acting|David Alan Grier|
+----+------------------+---------+------+----------------+
以下代码用于转换和生成返回时间段“01/11/1999”和“06/11/1999”之间前5个职业的结果
scala> dailyshow.
withColumn("showdate",to_date(unix_timestamp(col("showdate"),"MM/dd/yyyy").
cast("timestamp"))).
where((col("showdate") >= "1999-01-11") and (col("showdate") <= "1999-06-11")).
groupBy(col("occupation")).agg(count("*").alias("count")).
orderBy(desc("count")).
limit(5).show
+------------------+-----+
| occupation|count|
+------------------+-----+
| actor| 29|
| actress| 20|
| comedian| 4|
|television actress| 3|
| stand-up comedian| 2|
+------------------+-----+
我的问题是如何在使用RDD时编码并获得相同的结果?
scala> dailyshow.first
res12: org.apache.spark.sql.Row = [1999,actor,1/11/1999,Acting,Michael J. Fox]
我使用SimpleDateFormat
来解析DataFrame
中的字符串。
以下是代码:
val format = new java.text.SimpleDateFormat("MM/dd/yyyy")
dailyshow.
map(x => x.mkString(",")).
map(x => x.split(",")).
map(x => format.parse(x(2))).first // returns Mon Jan 11 00:00:00 PST 1999
答案 0 :(得分:0)
如果我是你,我会使用org.apache.spark.sql.functions中定义的spark的内部日期函数,而不是使用简单的日期和映射手动完成它。这是因为使用数据帧函数要简单得多,更加惯用,更不容易出错并且表现更好。
让我们假设您有一个数据帧df,其中包含名为dateString的列,其中包含格式为MM / dd / yyyy的日期字符串。
我们还假设您要将其转换为日期以提取年份,然后以yyyy.MMMMM.dd格式显示
你能做的是:
val dfWithDate = df.withColumn("date", to_date($"dateString")
val dfWithYear = dfWithDate.withColumn("year", year($"date"))
val dfWithOutput = dfWithYear.withColumn("dateOutput", date_format("$date", "yyyy.MMMMM.dd")
现在年份列将包含年份,而dateOutput列将包含具有您的格式的字符串表示。
答案 1 :(得分:0)
写这篇文章时有很多弃用警告:D
所以我们在RDD中有这些数据
val rdd = sc.parallelize(Array(
Array("1999","actor","1/11/1999","Acting"," Michael J. Fox"),
Array("1999","Comedian","1/12/1999","Comedy"," Sandra Bernhard"),
Array("1999","television actress","1/13/1999","Acting","Tracey Ullman"),
Array("1999","film actress","1/14/1999","Acting","Gillian Anderson"),
Array("1999","actor","1/18/1999","Acting","David Alan Grier")))
然后根据您的问题,我们在日期进行过滤:
val filtered = rdd.filter{ x =>
format.parse(x(2)).after( new java.util.Date("01/10/1999")) &&
format.parse(x(2)).before(new java.util.Date("01/14/1999"))
}
然后我们得到这个:
Array[Array[String]] = Array(
Array(1999, actor, 1/11/1999, Acting, " Michael J. Fox"),
Array(1999, Comedian, 1/12/1999, Comedy, " Sandra Bernhard"),
Array(1999, television actress, 1/13/1999, Acting, Tracey Ullman))
然后我们将第二个元素作为关键字对它们进行分组,并计算出现次数:
filtered.keyBy(x => x(1) ).map((_, 1) ).reduceByKey(_+_).map{ case ((a, b) ,c) => (a,c) }
如果一切顺利,你应该得到:
Array[(String, Int)] = Array((television actress,1), (Comedian,1), (actor,1))