Question

我正在预处理包含100,000多个博客网址的数据框，其中许多博客网址包含来自博客标题的内容。 grep功能允许我删除其中的许多网址，因为它们与归档，Feed，图片，附件或其他各种原因有关。其中之一是它们含有“原子”。

例如，

string <- "http://www.example.com/2014/05/update-on-atomic-energy-legislation/feed/atom/archives/"
row <- "one" 
df <- data.frame(row, string) 
df$string <- as.character(df$string) df[-grep("atom", string), ]

我的问题是模式“atom”可能出现在博客标题中，这是重要内容，我不想丢弃这些网址。

我怎样才能将grep集中在最后20个字符上（或者某些数字可以大大降低我将包含模式而非结尾元素的内容的风险？这个问题最后使用$但不是使用R;此外，我不知道如何延长$ 20个字符。Regular Expressions _# at end of string

假设模式在任一端或两端都有正斜杠并非总是如此。例如，/ atom /.

函数substr可以隔离字符串的结尾部分，但我不知道如何仅在该部分内进行grep。下面的伪代码使用％in％函数来试图说明我想要做的事情。

substr(df$string, nchar(df$string)-20, nchar(df$string))＃提取最后20个字符;从nchar结束-20开始，结束

但下一步是什么？

string[-grep(pattern = "atom" %in% (substr(string, nchar(string)-20, nchar(string))), x = string)]

感谢您的指导。

Answer 1

lastpart=substr(df$string, nchar(df$string)-20, nchar(df$string))
if(length(grep("atom",lastpart))>0){
  # atom was in there
} else {
  # atom was not in there
}

也可以在没有lastpart的情况下完成..

if(length(grep("atom",substr(df$string, nchar(df$string)-20, nchar(df$string))))>0){
  # atom was in there
} else {
  # atom was not in there
}

但事情变得难以阅读......（虽然提供了更好的表现）

Answer 2

你可以尝试使用URL组件深度方法（即只返回5个斜杠后包含单词“atom”的df行）：

find_first_match <- function(string, pattern) {
  components <- unlist(strsplit(x = string, split = "/", fixed = TRUE), use.names = FALSE)
  matches <- grepl(pattern = pattern, x = components)
  if(any(matches) == TRUE) {
    first.match <- which.min(matches)
  } else {
    first.match <- NA
  }
  return(first.match)
}

可以使用如下：

# Add index for first component match of "atom" in url
df$first.match <- lapply(df$string, find_first_match, pattern = "atom")

# Return rows which have the word "atom" only after the first 5 components
df[first.match >= 6]

#   row                                                                                 string first.match
# 1 one http://www.example.com/2014/05/update-on-atomic-energy-legislation/feed/atom/archives/           6

这使您可以根据“atom”出现时的深度来控制返回哪些URL

Answer 3

我选择了第二个答案，因为我更容易理解，因为第一个答案是不可能预测“组件深度”中要包含多少正斜杠。

从内部函数到最广泛函数的第二个答案翻译成英文说：使用substr()函数，子字符串定义字符串的最后20个字符;

然后查找模式“atom”是否在具有grep()函数的子字符串中;

然后计算在子串中是否多次发现“atom”，因此length大于零，并且该行将被省略;

最后，如果没有匹配的模式，即在最后20个字符中找不到“atom”，则单独留下行 - 全部使用if…else()函数

仅在字符串的结尾部分内使用正则表达式

3 个答案: