我是pyspark的新手。我通常和熊猫一起工作。我使用pyspark中的列逐行进行迭代。我的数据集看起来像:-
library(stringr)
column <- c("5699420001","00409226602")
column_standard <- sapply(column, function(x){
ifelse(nchar(x) == 11,
stringr::str_replace(x, "^([0-9]{5})([0-9]{4})(.*)", "\\1\\-\\2-\\3"),
stringr::str_replace(x, "^([0-9]{4})([0-9]{4})(.*)", "\\1\\-\\2-\\3"))
})
column_standard
# 5699420001 00409226602
# "5699-4200-01" "00409-2266-02"
在熊猫数据框中,它也具有给定的索引,但在火花中则没有。 在大熊猫中:-
+-------------------+--------------------+--------+-----+
| DateTime| user_name|keyboard|mouse|
+-------------------+--------------------+--------+-----+
|2019-10-21 08:35:01|prathameshsalap@g...| 333.0|658.0|
|2019-10-21 08:35:01|vaishusawant143@g...| 447.5| 0.0|
|2019-10-21 08:35:01| you@example.com| 0.5| 1.0|
|2019-10-21 08:40:01| you@example.com| 0.0| 0.0|
|2019-10-21 08:40:01|prathameshsalap@g...| 227.0|366.0|
|2019-10-21 08:40:02|vaishusawant143@g...| 472.0| 0.0|
|2019-10-21 08:45:01| you@example.com| 0.0| 0.0|
|2019-10-21 08:45:01|prathameshsalap@g...| 35.0|458.0|
|2019-10-21 08:45:01|vaishusawant143@g...| 1659.5| 0.0|
|2019-10-21 08:50:01| you@example.com| 0.0| 0.0|
+-------------------+--------------------+--------+-----+
同一件事如何在火花中完成?
对于5分钟后生成的每个用户数据(例如,如果用户在8:30:01开始,则下一个日志在8:35:01生成)。在第二个问题中,我想为每个用户查找一个空闲时间。空闲时间的计算方法是,如果他在接下来的30分钟(1500)内不移动鼠标或使用键盘,那么我会添加用户空闲时间。
将字典值转换为数据框后,我的预期输出如下:-
## pandas
usr_log = pd.read_csv("data.csv")
unique_users = usr_log.user_name.unique()
usr_log.sort_values(by='DateTime', inplace=True)
users_new_data = dict()
users_new_data[user] = {'start_time': None}
for user in unique_users:
count_idle = 0
## first part of the question
for index in usr_log.index:
if user == usr_log['user_name'][index]:
if users_new_data[user]['start_time'] is None:
users_new_data[user]['start_time'] = usr_log['DateTime'][index]
## Second part of the question
if usr_log['keyboard'][index] == 0 and usr_log['mouse'][index] == 0:
count_idle += 1
else:
count_idle = 0
if count_idle >= 5:
if count_idle == 5:
users_new_data[usr_log['user_name'][index]]['idle_time'] \
= users_new_data[usr_log['user_name'][index]].get('idle_time') \
+ datetime.timedelta(0, 1500)
else:
users_new_data[usr_log['user_name'][index]]['idle_time'] \
= users_new_data[usr_log['user_name'][index]].get('idle_time') \
+ datetime.timedelta(0, 300)
答案 0 :(得分:3)
如果您想为每个用户查找他们拥有的第一个时间戳,可以先在熊猫中简化它,
usr_log[['user_name','DateTime']].groupby(['user_name']).min()
对于火花将非常相似
urs_log = sparkSession.read.csv(...)
urs_log.groupBy("user_name").agg(min("DateTime"))
您只需要将DateTime
列重命名为所需的列,并尝试不使用for loops in pandas。
在spark中,您有一个分布式集合,并且不可能进行for循环,您必须对列应用转换,而永远不要对单行数据应用逻辑。
答案 1 :(得分:1)
这里是相同的解决方案,
dataFrame = (spark.read.format("csv").option("sep", ",").option("header", "true").load("data.csv"))
df.show()
+-------------------+--------------------+--------+-----+
| DateTime| user_name|keyboard|mouse|
+-------------------+--------------------+--------+-----+
|2019-10-21 08:35:01|prathameshsalap@g...| 333.0|658.0|
|2019-10-21 08:35:01|vaishusawant143@g...| 447.5| 0.0|
|2019-10-21 08:35:01| you@example.com| 0.5| 1.0|
|2019-10-21 08:40:01|prathameshsalap@g...| 227.0|366.0|
|2019-10-21 08:40:02|vaishusawant143@g...| 472.0| 0.0|
|2019-10-21 08:45:01| you@example.com| 0.0| 0.0|
|2019-10-21 08:45:01|prathameshsalap@g...| 35.0|458.0|
|2019-10-21 08:45:01|vaishusawant143@g...| 1659.5| 0.0|
|2019-10-21 08:50:01| you@example.com| 0.0| 0.0|
+-------------------+--------------------+--------+-----+
df1 = df.groupBy("user_name").agg(min("DateTime"))
df1.show()
+--------------------+-------------------+
| user_name| min(DateTime)|
+--------------------+-------------------+
|prathameshsalap@g...|2019-10-21 08:35:01|
|vaishusawant143@g...|2019-10-21 08:35:01|
| you@example.com|2019-10-21 08:35:01|
+--------------------+-------------------+
其他部分-
df1 = df.withColumn("count",when(((col("keyboard")==0.0) & (col("mouse")==0.0)), count_idle+1).otherwise(0))
df2 = df1.withColumn("Idle_Sec",when((col("count")==0), 300).otherwise(1500))
df2.show()
+-------------------+--------------------+--------+-----+-----+--------+
| DateTime| user_name|keyboard|mouse|count|Idle_Sec|
+-------------------+--------------------+--------+-----+-----+--------+
|2019-10-21 08:35:01|prathameshsalap@g...| 333.0|658.0| 0| 300|
|2019-10-21 08:35:01|vaishusawant143@g...| 447.5| 0.0| 0| 300|
|2019-10-21 08:35:01| you@example.com| 0.5| 1.0| 0| 300|
|2019-10-21 08:40:01| you@example.com| 0.0| 0.0| 1| 1500|
|2019-10-21 08:40:01|prathameshsalap@g...| 227.0|366.0| 0| 300|
|2019-10-21 08:40:02|vaishusawant143@g...| 472.0| 0.0| 0| 300|
|2019-10-21 08:45:01| you@example.com| 0.0| 0.0| 1| 1500|
|2019-10-21 08:45:01|prathameshsalap@g...| 35.0|458.0| 0| 300|
|2019-10-21 08:45:01|vaishusawant143@g...| 1659.5| 0.0| 0| 300|
|2019-10-21 08:50:01| you@example.com| 0.0| 0.0| 1| 1500|
+-------------------+--------------------+--------+-----+-----+--------+
df3 = df2.groupBy("user_name").agg(min("DateTime").alias("start_time"),sum("Idle_Sec").alias("Sum_Idle_Sec"))
+--------------------+-------------------+------------+
| user_name| start_time|Sum_Idle_Sec|
+--------------------+-------------------+------------+
|prathameshsalap@g...|2019-10-21 08:35:01| 900|
|vaishusawant143@g...|2019-10-21 08:35:01| 900|
| you@example.com|2019-10-21 08:35:01| 4800|
+--------------------+-------------------+------------+
df3.withColumn("Idle_time",(F.unix_timestamp("start_time") + col("Sum_Idle_Sec")).cast('timestamp')).show()
+--------------------+-------------------+---------+----------------------+
| user_name| start_time|Sum_Idle_Sec| Idle_time|
+--------------------+-------------------+---------+----------------------+
|prathameshsalap@g...|2019-10-21 08:35:01| 900|2019-10-21 08:50:01|
|vaishusawant143@g...|2019-10-21 08:35:01| 900|2019-10-21 08:50:01|
| you@example.com|2019-10-21 08:35:01| 4800|2019-10-21 09:55:01|
+--------------------+-------------------+---------+----------------------+
答案 2 :(得分:0)
您应按照以下示例进行操作:
“ 要做的事情”可以是您定义的任何函数。