我有一列有5000行。我的目的是检查每一行是否有重复的单词。例如:
第一行:我叫鲍比
第二行:我叫Boby
第三排:这是你的房子
从上面的示例中,我们可以看到在第一行和第二行之间有3个重复的单词,而在第二行和第三行之间只有1个重复的单词。我要使具有3个或更多重复单词的每一行成为一个相同的单词。例如:
我叫Bobby
我叫Bobby
这是你的房子
我在河里很新。你能帮我吗?
答案 0 :(得分:0)
使用tidyverse
的解决方案。我创建了一个名为dat
的示例数据框,其中包含五行。请注意,该列是字符形式,而不是因数形式。请注意此示例的结果。如您所见,第3行和第4行有很大的不同,但是因为它们有3个通用词,并且因为第3行与第2行和第1行相似,所以最终第4行被第1行替换了。也许可以。我只想告诉您,您描述的情况可能导致这种情况。
library(tidyverse)
dat2 <- dat %>%
# Split the sentence
mutate(V2 = str_split(V1, pattern = " ")) %>%
# Create a new column for the next word
mutate(V3 = lead(V2)) %>%
# Count the number of intersection
mutate(V4 = lag(map2_int(V2, V3, ~length(intersect(.x, .y))),
default = 0L)) %>%
# If >= 3 words are the same, set to be NA, otherwise the same as V1
mutate(V5 = if_else(V4 >= 3, NA_character_, V1)) %>%
# Fill the NA based on the previous row
fill(V5) %>%
# Select column V1 and V5
select(V1, V5)
dat2
# V1 V5
# 1 My name is Bobby My name is Bobby
# 2 My name is Boby My name is Bobby
# 3 My name is Boy My name is Bobby
# 4 This is your house name Boy My name is Bobby
# 5 R is awesome R is awesome
数据
dat <- read.table(text = "'My name is Bobby'
'My name is Boby'
'My name is Boy'
'This is your house name Boy'
'R is awesome'",
stringsAsFactors = FALSE)
答案 1 :(得分:0)
不需要57编译的软件包依赖项“ verse”:
library(stringi) # helpful string function that stringr builds a crutch around
data.frame(
V1 = c("My name is Bobby", "My name is Boby", "This is your house"),
stringsAsFactors = FALSE
) -> dat
for (idx in 1:(length(dat$V1)-1)) {
stri_split_boundaries( # split the strings
stri_trans_tolower(dat$V1[idx:(idx+1)]), # turn elements lower case for easier comparison
type = "word", # split into words
skip_word_none = TRUE # ignore whitespace
) -> words
if (sum(words[[1]] %in% words[[2]]) >= 3) { # compare the word sets
dat[idx+1, "V1"] <- dat[idx, "V1"]
}
}