删除包含特殊字符的行

时间:2016-03-31 21:30:00

标签: regex r grepl

我正在过滤一个以列表形式读取的大量数据集。我需要过滤掉特殊标记,然后我就会卡在其中一些标记上。这是我现在拥有的:

library(R.utils)
library(stringr)

gunzip("movies.list.gz") #open file
movies <- readLines("movies.list") #read lines in
movies <- gsub("[\t]", '', movies) #remove tabs (\t)
#movies <- gsub(, '', movies)
a <- movies[!grepl("\\{", movies)] # removed any line that contained special character {
b <- a[!grepl("\\(V)", a)] #remove porn?
c <- b[!grepl("\\(TV)", b)] #remove tv
d <- c[!grepl("\\(VG)", c)] #remove video games
e <- d[!grepl("\\(\\?\\?\\?\\?\\)", d)] #remove anyhting with unknown date ex (????)
f <- e[!grepl("\\#)", e)] 
g <- e[!grepl("\\!)", f)]


i <- data.frame(g)
i <- i[-c(1:15),]
i <- data.frame(i)
i$Date <- lapply(strsplit(as.character(i$i), "\\(....\\)"), "[", 2)
i$Title <- lapply(strsplit(as.character(i$i), "\\(....\\)"), "[", 1)

我仍然需要清理一下,并删除原始列(i)但是从输出中可以看到它没有删除特殊字符!或#

> head(i)
                                i      Date                Title
1            "!Next?" (1994)1994-1995 1994-1995            "!Next?" 
2         "#1 Single" (2006)2006-???? 2006-????         "#1 Single" 
3 "#1MinuteNightmare" (2014)2014-???? 2014-???? "#1MinuteNightmare" 
4           "#30Nods" (2014)2014-2015 2014-2015           "#30Nods" 
5       "#7DaysLater" (2013)2013-???? 2013-????       "#7DaysLater" 
6            "#ATown" (2014)2014-???? 2014-????            "#ATown" 

我真正想要做的是删除包含这些特殊字符的整行。我尝试的一切都抛出了错误。有什么建议?

2 个答案:

答案 0 :(得分:0)

你可以将任何不是字母数字的东西或“ - ”或“()”分成这样:

gsub("[^A-Za-z()-]", "", row)

答案 1 :(得分:0)

为了删除行,您可以尝试类似下面的行:

data[!grepl(pattern = "[#!]", x = data)]

如果您要删除所有带有特殊字符的行,您可以使用{luke1018使用grepl建议的代码:

data[!grepl(pattern = "[^A-Za-z0-9-()]", x = data)]