正则表达式花费太多时间在R中编译

时间:2015-07-28 09:32:35

标签: regex r time-complexity text-mining

我使用

在ratingsFile中读取了一个文件
ratingsFile <- readLines("~/ratings.list",encoding = "UTF-8")

该文件的前几行看起来像

  0000000125  1478759   9.2  The Shawshank Redemption (1994)
  0000000125  1014575   9.2  The Godfather (1972)
  0000000124  683611   9.0  The Godfather: Part II (1974)
  0000000124  1451861   8.9  The Dark Knight (2008)
  0000000124  1150611   8.9  Pulp Fiction (1994)
  0000000133  750978   8.9  Schindler's List (1993)

使用正则表达式我提取

  match <- gregexpr("[0-9A-Za-z;'$%&?@./]+",ratingsFile)
  match <- regmatches(ratingsFile,match)


  next_match <- gregexpr("[0-9][.][0-9]+",ratingsFile)
  next_match <- regmatches(ratingsFile,next_match)

匹配的示例输出类似于

  "0000000125" "1014575"    "9.2"        "The"        "Godfather"  "1972"  

为了清理数据并更改为我需要的表格

  movies_name <- character(0)
  rating <- character(0)
  for(i in 1:length(match)){

      match[[i]]<-match[[i]][-1:-3] #for removing not need cols 
      len <- length(match[[i]])
      match[[i]]<-match[[i]][-len]#removing last column also not needed
      movies_name<-append(movies_name,paste(match[[i]],collapse =" "))
      #appending movies name
      rating <- append(rating,next_match[[i]]) 
      #appending rating
}

现在这个最后一段代码执行时间太长了。我已经将编译过程保留了好几个小时但仍然没有完成,因为文件长度为636497行。

在这种情况下,如何减少编译时间?

3 个答案:

答案 0 :(得分:2)

试试这个:

ratingsFile <- readLines(n = 6)
0000000125  1478759   9.2  The Shawshank Redemption (1994)
0000000125  1014575   9.2  The Godfather (1972)
0000000124  683611   9.0  The Godfather: Part II (1974)
0000000124  1451861   8.9  The Dark Knight (2008)
0000000124  1150611   8.9  Pulp Fiction (1994)
0000000133  750978   8.9  Schindler's List (1993)
setNames(as.data.frame(t(sapply(regmatches(ratingsFile, regexec("\\d{10}\\s+\\d+\\s+([0-9.]+)\\s+(.*?)\\s\\(\\d{4}\\)", ratingsFile)), "[", -1))), c("rating", "movie_name"))
#   rating               movie_name
# 1    9.2 The Shawshank Redemption
# 2    9.2            The Godfather
# 3    9.0   The Godfather: Part II
# 4    8.9          The Dark Knight
# 5    8.9             Pulp Fiction
# 6    8.9         Schindler's List

答案 1 :(得分:2)

如果我理解你想做什么(只获得电影片名),这里有另一种选择来获得你想要的东西:

unlist(lapply(strsplit(ratingsFile, "\\s{2,}"), # split each line whenever there are at least 2 spaces
                                 function(x){ # for each resulting vector
                                    x <- gsub(" \\(\\d{4}\\)$", "", tail(x, 1)) # keep only the needed part (movie title)
                                    x
                                 }))

# [1] "The Shawshank Redemption" "The Godfather"            "The Godfather: Part II"   "The Dark Knight"          "Pulp Fiction"            
# [6] "Schindler's List"

NB:请注意,您可以将结果矢量放在data.frame中和/或保留前一行中的其他信息。

答案 2 :(得分:1)

如果您想查找和使用数据中的某些数据,我认为您可以使用此正则表达式:

/^ *(\d*) *(\d*) *(\d+\.\d+)(.*)\((\d+)\)$/gm

有替换

  • $ 1 =&gt;第一栏
  • $ 2 =&gt;第二栏
  • $ 3 =&gt;第三栏(可能评级)
  • $ 4 =&gt;电影名称
  • $ 5 =&gt;电影年

[Regex Demo]