文字处理/正则表达式?在R

时间:2013-09-04 12:26:58

标签: regex r

我有一个包含以下列的数据框。

user_id: g17165fd2e0bba9a449857645bb6g3a9a7ef8e6c 

time: 1361553741 

url: a string with an url.

有时,网址采用https://SOMETHING.COM/NAME/forum/thread?thread_id=51形式。

我想创建一个数据框,告诉我每个用户,在时间x和y之间,他或她访问每个thread_id的时间。因此,观察数等于用户数和列数等于线程数+ 1(总视图数)

数据集非常大,因此必须并行执行此操作。

在R中执行此操作的最佳方式是什么?

非常感谢!

PS:@David创建的代码可以像我提到的那样生成数据框,也为我的问题提供了完美的答案。

set.seed(2)
#make junk data
dat <- data.frame(user=1:5,
                                 time=1:20,
                                 url=paste0("https://domain.com/forum/thread?     thread_id=",sample(5,20,T)))

1 个答案:

答案 0 :(得分:1)

很确定这对你有用:

> library(plyr)
> library(doMC)
> library(reshape2)
> 
> set.seed(2)
> #make junk data
> dat <- data.frame(user=1:5,
+                   time=1:20,
+                   url=paste0("https://domain.com/forum/thread?thread_id=",sample(5,20,T)))
> head(dat)
  user time                                         url
1    1    1 https://domain.com/forum/thread?thread_id=1
2    2    2 https://domain.com/forum/thread?thread_id=4
3    3    3 https://domain.com/forum/thread?thread_id=3
4    4    4 https://domain.com/forum/thread?thread_id=1
5    5    5 https://domain.com/forum/thread?thread_id=5
6    1    6 https://domain.com/forum/thread?thread_id=5
> #subet within time range
> dat <- dat[dat$time >=1 & dat$time <= 20,]
> 
> #make threadID variable
> dat$threadid <- gsub("^.*thread_id=",'',dat$url)
> 
> 
> #register parallel cores
> registerDoMC(4)
> #count number of thread occurrences for each user (in parallel)
> dat.new <- ddply(dat,.(user,threadid),summarize,threadcount=length(threadid),.parallel=TRUE)
> #reshape data to be in the format you want
> dat.new <- dcast(dat.new,user~threadid,value.var="threadcount",fill=0)
> #add total views
> dat.new$totalview <- rowSums(dat.new[,-1])
> dat.new
  user 1 2 3 4 5 totalview
1    1 1 0 1 0 2         4
2    2 1 1 0 1 1         4
3    3 0 1 1 1 1         4
4    4 2 0 2 0 0         4
5    5 1 0 2 0 1         4