我有一个数据集,其中包含在网站上完成的所有点击的1列。我想找到在整个数据中重复出现的模式,该数据包含超过100万行,并具有17000个不同的模式。我还想知道每种模式在每次点击上花费的平均时间。我已经在SAS中编写了代码,该代码将每个模式进行分组,并且还找到了每次单击之间的时间差,但是我没有得到想要的输出。另外,如果模式包含连续关键字“ one”,那么我希望将其合并并视为单个关键字“ one”。
例如,根据我的代码,我得到以下输出:
Clicks Group Time(Seconds)
A 1 6
B 1 2
C 1 4
one 1 0
D 2 12
E 2 5
F 2 0
A 3 9
B 3 6
C 3 7
one 3 6
one 3 0
H 4 8
I 4 9
J 4 0
预期输出:
Clicks Average Time Count
ABCone A-7.5,B-4,C-0,one-2 2
DEF D-12,E-5,F-0 1
HIJ H-8,I-9,J-0 1
答案 0 :(得分:2)
以下内容再现了您的预期输出。
df %>%
group_by(Clicks) %>%
mutate(`Average Time` = paste(sprintf("%s-%2.1f", Clicks, mean(Time.Seconds.)))) %>%
group_by(Group) %>%
mutate(
Clicks = paste(Clicks, collapse = ""),
`Average Time` = paste(`Average Time`, collapse = ",")) %>%
slice(1) %>%
ungroup() %>%
select(-Group, -Time.Seconds.) %>%
count(Clicks, `Average Time`)
## A tibble: 3 x 3
# Clicks `Average Time` n
# <chr> <chr> <int>
#1 ABC A-7.5,B-4.0,C-0.0 2
#2 DEF D-12.0,E-5.0,F-0.0 1
#3 HIJ H-8.0,I-9.0,J-0.0 1
不同(重新)分组和paste
分组是一个相当简单的问题。
df <- read.table(text =
"Clicks Group Time(Seconds)
A 1 6
B 1 2
C 1 0
D 2 12
E 2 5
F 2 0
A 3 9
B 3 6
C 3 0
H 4 8
I 4 9
J 4 0 ", header = T)
对于更新后的数据(请注意,预期输出平均值C
中有误)
df %>%
group_by(Clicks) %>% # Do the averaging
mutate(`Average Time` = paste(sprintf("%s-%2.1f", Clicks, mean(Time.Seconds.)))) %>%
group_by(Clicks, Group) %>% # Deal with duplicates per Clicks+Group
slice(1) %>%
group_by(Group) %>% # Paste entries
mutate(
Clicks = paste(Clicks, collapse = ""),
`Average Time` = paste(`Average Time`, collapse = ",")) %>%
slice(1) %>%
ungroup() %>% # Ungroup to prepare for counting
select(-Group, -Time.Seconds.) %>%
count(Clicks, `Average Time`)
## A tibble: 3 x 3
# Clicks `Average Time` n
# <chr> <chr> <int>
#1 ABCone A-7.5,B-4.0,C-5.5,one-2.0 2
#2 DEF D-12.0,E-5.0,F-0.0 1
#3 HIJ H-8.0,I-9.0,J-0.0 1
以及更新的数据
df <- read.table(text =
"Clicks Group Time(Seconds)
A 1 6
B 1 2
C 1 4
one 1 0
D 2 12
E 2 5
F 2 0
A 3 9
B 3 6
C 3 7
one 3 6
one 3 0
H 4 8
I 4 9
J 4 0 ", header = T)
答案 1 :(得分:1)
如果以我们可以复制和粘贴以使用它的方式发布数据,您将获得更多帮助。我认为'address-data': [ {'address': settings.ip, 'prefix':settings.subnet} ],
'gateway': settings.gateway,
在这里会有所帮助。
编辑: 有人将OP编辑为更可解析的。我能够使您接近,但是“平均时间”列并不是您想要的。
dplyr
答案 2 :(得分:1)
在SAS
Proc MEANS
步骤和CLASS
语句来完成示例代码
data have; input
Clicks $ Group Time; datalines;
A 1 6
B 1 2
C 1 0
D 2 12
E 2 5
F 2 0
A 3 9
B 3 6
C 3 0
H 4 8
I 4 9
J 4 0
run;
* presume no clicks value contains pipe (|) character;
data have2 / view=have2;
length pattern $30;
pattern = '|'; * prepare for bounded token search via INDEX();
do _n_ = 1 by 1 until (last.group);
set have;
by group;
* use this line if all items in group are known to be distinct ;
* pattern = cats(pattern,clicks);
* track observed clicks by searching the growing pattern of the group;
bounded_token = cats( '|', clicks, '|' );
if index (pattern, trim(bounded_token) ) = 0 then
pattern = cats (pattern, clicks, '|');
end;
if length (pattern) = lengthc(pattern) then do;
put 'WARNING: pattern needs more length';
stop;
end;
* remove token bounders;
pattern = compress(pattern,'|');
do _n_ = 1 to _n_;
set have;
output;
end;
run;
proc means noprint data=have2;
class pattern clicks;
var time;
ways 2;
output out=have_means mean=mean ;
run;
data want (keep=pattern time_summary _freq_);
do until (last.pattern);
set have_means;
by pattern;
length time_summary $100;
time_summary = catx(',',time_summary,catx('-',clicks,mean));
end;
run;