带有关键字的子集数据框

时间:2018-11-20 15:15:14

标签: r dataframe text subset keyword

我有一个由Twitter数据(ID号,folder_count,clean_text)组成的数据框。我有兴趣将数据框分为两个子集:一个子集存在关键字,一个子集存在

例如,我将关键字存储为值:

KeyWords <- c("abandon*", "abuse*", "agitat*" ,"attack*", "bad", "brutal*",
                       "care", "caring", "cheat*", "compassion*", "cruel*", "damag*",
                       "damn*", "destroy*", "devil*", "devot*", "disgust*", "envy*",
                       "evil*", "faith*","fault*", "fight*", "forbid*", "good", "goodness",
                       "greed*", "gross*", "hate", "heaven*", "hell", "hero*", "honest*",
                       "honor*", "hurt*","ideal*", "immoral*", "kill*",  "liar*","loyal*",
                       "murder*", "offend*", "pain", "peace*","protest", "punish*","rebel*",
                       "respect", "revenge*", "ruin*", "safe*", "save", "secur*", "shame*",
                       "sin", "sinister", "sins", "slut*", "spite*", "steal*", "victim*",
                       "vile", "virtue*", "war", "warring", "wars", "whore*", "wicked*",
                       "wrong*", "benefit*", "harm*", "suffer*","value*") %>% paste0(collapse="|")

我已经制作了原始数据帧(Data2)的子集(Data1),其中Data2仅包含Data1中的观察值,其中一个或多个关键字出现在clean_text列中。像这样:

Data2 <- Data1[with(Data1, grepl(paste0("\\b(?:",paste(KeyWords, collapse="|"),")\\b"), clean_text)),]

现在,我想在Data3列中仅Data1中仅出现关键词{em>不存在的情况下clean_text是否可以对上述关键字子集进行逆运算?或者,我可以从Data2中减去Data1以获得新的子集Data3吗?

1 个答案:

答案 0 :(得分:1)

R中的“逆”运算符是!-这会将TRUE翻转为FALSE,反之亦然。因此,通过您的示例,您正在寻找的是

Data3 <- Data1[!with(Data1, grepl(paste0("\\b(?:",paste(KeyWords, collapse="|"),")\\b"), clean_text)),]