Question

我有一个非常非结构化的文本文件，我使用readLines进行了读取。我想将某些字符串更改为变量中的另一个字符串（以下称为“新”）。

下面，我希望操纵的文本包含所有术语：“一个”，“两个”，“三个”和“四个”一次，而不是“更改”字符串。但是，您可以看到sub更改了每个元素中的第一个模式，但是我需要代码来忽略有带引号的新字符串。

请参见下面的示例代码和数据。

 #text to be changed
 text <- c("TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT change",
        "TEXT TEXT TEXT change TEXT TEXT TEXT TEXT TEXT change", 
        "TEXT TEXT TEXT change TEXT TEXT TEXT TEXT")

 #Variable containing input for text
 new <- c("one", "two", "three", "four")
 #For loop that I want to include 
 for (i in 1:length(new)) {

   text  <- sub(pattern = "change", replace = new[i], x = text)

 }
 text

Answer 1

这个怎么样？逻辑是锤击一个字符串直到不再有change。在每个“匹配”（找到change的位置）上，沿new向量移动。

text <- c("TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT change",
          "TEXT TEXT TEXT change TEXT TEXT TEXT TEXT TEXT change", 
          "TEXT TEXT TEXT change TEXT TEXT TEXT TEXT")

#Variable containing input for text
new <- c("one", "two", "three", "four")
new.i <- 1

for (i in 1:length(text)) {
  while (grepl(pattern = "change", text[i])) {
    text[i] <- sub(pattern = "change", replacement = new[new.i], x = text[i])
    new.i <- new.i + 1
  }
}
text

[1] "TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one" 
[2] "TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three"
[3] "TEXT TEXT TEXT four TEXT TEXT TEXT TEXT"

Answer 2

这是使用gregexpr()和regmatches()的另一种解决方案：

#text to be changed
text <- c("TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT change",
          "TEXT TEXT TEXT change TEXT TEXT TEXT TEXT TEXT change",
          "TEXT TEXT TEXT change TEXT TEXT TEXT TEXT")

#Variable containing input for text
new <- c("one", "two", "three", "four")

# Alter the structure of text
altered_text <- paste(text, collapse = "\n")

# So we can use gregexpr and regmatches to get what you want
matches <- gregexpr("change", altered_text)
regmatches(altered_text, matches) <- list(new)

# And here's the result
cat(altered_text)
#> TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one
#> TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three
#> TEXT TEXT TEXT four TEXT TEXT TEXT TEXT

# Or, putting the text back to its old structure
# (one element for each line)
unlist(strsplit(altered_text, "\n"))
#> [1] "TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one" 
#> [2] "TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three"
#> [3] "TEXT TEXT TEXT four TEXT TEXT TEXT TEXT"

^{由reprex package（v0.2.1）于2018-10-16创建}

我们可以这样做，因为gregexpr()可以在文本中找到“ change”的所有匹配项；来自help("gregexpr")：

regexpr返回与文本给定长度相同的整数向量   第一场比赛的开始位置。...

gregexpr返回一个与文本长度相同的列表，其中每个元素   与regexpr的返回值格式相同，除了   给出了每（不相交）匹配的起始位置。

（添加了重点）。

然后，regmatches()可以用于提取gregexpr()找到的匹配项或替换；来自help("regmatches")：

用法

regmatches（x，m，invert = FALSE）
  regmatches（x，m，invert = FALSE）<-值

...

值
  具有匹配值的合适替换值的对象   不匹配的子字符串（请参阅详细信息）。

...

详细信息

替换功能可用于替换匹配的或   不匹配的子字符串。对于向量匹配数据，如果invert为FALSE，   值应为字符向量，长度为匹配的数目   米中的元素否则，它应该是字符向量的列表   与m相同的长度，每一个与替换数一样长   需要。

Answer 3

使用strsplit的另一种方法：

tl <- lapply(text, function(s) strsplit(s, split = " ")[[1]])
df <- stack(setNames(tl, seq_along(tl)))

ix <- df$values == "change"
df[ix, "values"] <- new
tapply(df$values, df$ind, paste, collapse = " ")

给出：

                                                  1 
 "TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one" 
                                                  2 
"TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three" 
                                                  3 
          "TEXT TEXT TEXT four TEXT TEXT TEXT TEXT"

另外，您可以将tapply调用包装在unname中：

 unname(tapply(df$values, df$ind, paste, collapse = " "))

给出：

[1] "TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT one" 
[2] "TEXT TEXT TEXT two TEXT TEXT TEXT TEXT TEXT three"
[3] "TEXT TEXT TEXT four TEXT TEXT TEXT TEXT"

如果您只想使用new的元素一次，则可以将代码更新为：

newnew <- new[1:3]

ix <- df$values == "change"
df[ix, "values"][1:length(newnew)] <- newnew
unname(tapply(df$values, df$ind, paste, collapse = " "))

您可以进一步对此进行更改，以考虑以下情况：需要替换的位置比需要替换的位置（示例中的change位置）多。

newnew2 <- c(new, "five")

tl <- lapply(text, function(s) strsplit(s, split = " ")[[1]])
df <- stack(setNames(tl, seq_along(tl)))

ix <- df$values == "change"
df[ix, "values"][1:pmin(sum(ix),length(newnew2))] <- newnew2[1:pmin(sum(ix),length(newnew2))]
unname(tapply(df$values, df$ind, paste, collapse = " "))

使用for循环替换非结构化文本文件中的单词

3 个答案: