找到R中的所有唯一字符串

时间:2017-07-26 19:49:29

标签: r string dataframe unique text-mining

我是R的新手。我有一个看起来像这样的数据框df(仅一个字符变量......我的实际df跨越100k +行,但为了简单起见,我们只看5行):

V1
oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects
angioedema chemically induced, angioedema chemically induced, oximetry
abo blood group system, imipramine poisoning, adverse effects
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy
thrombosis drug therapy

我希望能够输出每个唯一的字符串,使其看起来像这样:

V1
oximetry
hydrogen peroxide adverse effects
epoprostenol adverse effects
angioedema chemically induced
abo blood group system
imipramine poisoning
adverse effects
isoenzymes
myocardial infarction drug therapy
thrombosis drug therapy

我是否使用tm套餐?我尝试使用dtm,但我的代码效率很低,因为它会将dtm转换为矩阵,这需要100k +行的大量内存。

请指教。谢谢!

2 个答案:

答案 0 :(得分:3)

试试这个:

library(stringr)
library(tidyverse)

df <- data.frame(variable = c(
'oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects',
'angioedema chemically induced, angioedema chemically induced, oximetry',
'abo blood group system, imipramine poisoning, adverse effects',
'isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy',
'thrombosis drug therapy'), stringsAsFactors=FALSE)

mutate(df, variable = str_split(variable, ', ')) %>%
  unnest() %>% distinct()

答案 1 :(得分:1)

只使用基数R,您可以使用strsplit()在每个&#34;逗号+空格&#34;中分割大字符串。或&#34; \ n&#34;。然后使用unique()仅返回唯一字符串:

text_vec <- c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects
angioedema chemically induced, angioedema chemically induced, oximetry
abo blood group system, imipramine poisoning, adverse effects
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy
thrombosis drug therapy")

strsplit(text_vec, ", |\\n")[[1]])
# [1] "oximetry"                           "hydrogen peroxide adverse effects" 
# [3] "epoprostenol adverse effects"       "angioedema chemically induced"     
# [5] "angioedema chemically induced"      "oximetry"                          
# [7] "abo blood group system"             "imipramine poisoning"              
# [9] "adverse effects"                    "isoenzymes"                        
# [11] "myocardial infarction drug therapy" "thrombosis drug therapy"           
# [13] "thrombosis drug therapy"   

unique(strsplit(text_vec, ", |\\n")[[1]])
# [1] "oximetry"                           "hydrogen peroxide adverse effects" 
# [3] "epoprostenol adverse effects"       "angioedema chemically induced"     
# [5] "abo blood group system"             "imipramine poisoning"              
# [7] "adverse effects"                    "isoenzymes"                        
# [9] "myocardial infarction drug therapy" "thrombosis drug therapy"