我是R和Spark的新手,但我正在尝试创建一个可扩展的R应用程序来检测用户执行的增加/减少查询。
我有一个包含以下格式数据的Spark DataFrame:
+-------+------------------------+-------------------------+
| user | query | query_time |
+-------+------------------------+-------------------------+
| user1 | Hp tablet | 2011-08-21T11:07:57.346 |
| user2 | Hp tablet | 2011-08-21T22:22:32.599 |
| user3 | Hp tablet | 2011-08-22T19:08:57.412 |
| user4 | hp laptop | 2011-09-05T15:33:31.489 |
| user5 | Samsung LCD 550 | 2011-09-01T10:28:33.547 |
| user6 | memory stick | 2011-09-06T17:15:42.852 |
| user7 | Castle | 2011-08-28T22:06:37.618 |
+-------+------------------------+-------------------------+
此数据集有数十万行。我需要能够以某种方式想象,例如,“惠普平板电脑”正在上升。
我已经查看了一些可以帮助我实现这一目标的库(例如Breakout Detection,Anomaly Detection和this question),但我不知道它们是否与Spark配合得很好。如果他们这样做,我找不到任何关于如何进行编程的例子。
我正在使用R版本3.4.0和SparkR版本2.1.0,在Zeppelin笔记本上运行。
有没有人有任何想法?我也对任何其他方法持开放态度。 谢谢!
答案 0 :(得分:0)
%r
#created a sparkR dataframe
df_query <- createDataFrame(sqlContext, data.frame(query = c("Hp tablet","Hp tablet","Hp tablet","hp laptop", "Samsung LCD 550 "),
query_time = c("2011-08-21T11:07:57.346","2011-08-21T22:22:32.599","2011-08-22T19:08:57.412","2011-09-05T15:33:31.489","2011-09-01T10:28:33.547")))
#remove T as its not a timestamp format = "yyyy-MM-dd HH:mm:ss"
df_query_1 <- select(df_query, df_query$query, regexp_replace(df_query$query_time, '(T)', ' '))
+----------------+--------------------------------+
| query|regexp_replace(query_time,(T), )|
+----------------+--------------------------------+
| Hp tablet| 2011-08-21 11:07:...|
| Hp tablet| 2011-08-21 22:22:...|
| Hp tablet| 2011-08-22 19:08:...|
| hp laptop| 2011-09-05 15:33:...|
|Samsung LCD 550 | 2011-09-01 10:28:...|
+----------------+--------------------------------+
df_query_1 <- rename(df_query_1, query_time=df_query_1[[2]])
#registering temp table:
registerTempTable(df_query_1, "temp_query")
从上面创建的临时表中可视化:
%sql
select * from temp_query
答案 1 :(得分:0)
对于使用库(AnomalyDetection),数据应采用此格式
head(raw_data)
timestamp count
14393 1980-10-05 13:53:00 149.801
14394 1980-10-05 13:54:00 151.492
14395 1980-10-05 13:55:00 151.724
14396 1980-10-05 13:56:00 153.776
14397 1980-10-05 13:57:00 150.481
14398 1980-10-05 13:58:00 146.638
如果您的query_time是X轴,您将如何定义数字中的Y轴,而2011-08-21T11:07:57.346
T
表示的时间是11:07:57.346
。需要更多说明