我使用
在ratingsFile中读取了一个文件ratingsFile <- readLines("~/ratings.list",encoding = "UTF-8")
该文件的前几行看起来像
0000000125 1478759 9.2 The Shawshank Redemption (1994)
0000000125 1014575 9.2 The Godfather (1972)
0000000124 683611 9.0 The Godfather: Part II (1974)
0000000124 1451861 8.9 The Dark Knight (2008)
0000000124 1150611 8.9 Pulp Fiction (1994)
0000000133 750978 8.9 Schindler's List (1993)
使用正则表达式我提取
match <- gregexpr("[0-9A-Za-z;'$%&?@./]+",ratingsFile)
match <- regmatches(ratingsFile,match)
next_match <- gregexpr("[0-9][.][0-9]+",ratingsFile)
next_match <- regmatches(ratingsFile,next_match)
匹配的示例输出类似于
"0000000125" "1014575" "9.2" "The" "Godfather" "1972"
为了清理数据并更改为我需要的表格
movies_name <- character(0)
rating <- character(0)
for(i in 1:length(match)){
match[[i]]<-match[[i]][-1:-3] #for removing not need cols
len <- length(match[[i]])
match[[i]]<-match[[i]][-len]#removing last column also not needed
movies_name<-append(movies_name,paste(match[[i]],collapse =" "))
#appending movies name
rating <- append(rating,next_match[[i]])
#appending rating
}
现在这个最后一段代码执行时间太长了。我已经将编译过程保留了好几个小时但仍然没有完成,因为文件长度为636497行。
在这种情况下,如何减少编译时间?
答案 0 :(得分:2)
试试这个:
ratingsFile <- readLines(n = 6)
0000000125 1478759 9.2 The Shawshank Redemption (1994)
0000000125 1014575 9.2 The Godfather (1972)
0000000124 683611 9.0 The Godfather: Part II (1974)
0000000124 1451861 8.9 The Dark Knight (2008)
0000000124 1150611 8.9 Pulp Fiction (1994)
0000000133 750978 8.9 Schindler's List (1993)
setNames(as.data.frame(t(sapply(regmatches(ratingsFile, regexec("\\d{10}\\s+\\d+\\s+([0-9.]+)\\s+(.*?)\\s\\(\\d{4}\\)", ratingsFile)), "[", -1))), c("rating", "movie_name"))
# rating movie_name
# 1 9.2 The Shawshank Redemption
# 2 9.2 The Godfather
# 3 9.0 The Godfather: Part II
# 4 8.9 The Dark Knight
# 5 8.9 Pulp Fiction
# 6 8.9 Schindler's List
答案 1 :(得分:2)
如果我理解你想做什么(只获得电影片名),这里有另一种选择来获得你想要的东西:
unlist(lapply(strsplit(ratingsFile, "\\s{2,}"), # split each line whenever there are at least 2 spaces
function(x){ # for each resulting vector
x <- gsub(" \\(\\d{4}\\)$", "", tail(x, 1)) # keep only the needed part (movie title)
x
}))
# [1] "The Shawshank Redemption" "The Godfather" "The Godfather: Part II" "The Dark Knight" "Pulp Fiction"
# [6] "Schindler's List"
NB:请注意,您可以将结果矢量放在data.frame中和/或保留前一行中的其他信息。
答案 2 :(得分:1)
如果您想查找和使用数据中的某些数据,我认为您可以使用此正则表达式:
/^ *(\d*) *(\d*) *(\d+\.\d+)(.*)\((\d+)\)$/gm
有替换