如何将字符向量转换为变量名和str_count?

时间:2019-09-01 01:14:19

标签: r tidyverse stringr

我正试图通过对文本数据帧执行str_count的函数将术语的字符向量转换为变量,我不确定如何做到这一点。

给出如下矢量:

<div class="main-container">
  <div class="score">
    <span>You: <span id="PLScore">0</span></span>
    <span>Computer: <span id="AIScore">0</span></span>
  </div>

  <div class="user-choice">
    <img data-user="0" src="//placehold.it/50x50/888?text=ROCK">
    <img data-user="1" src="//placehold.it/50x50/eee?text=PAPER">
    <img data-user="2" src="//placehold.it/50x50/0bf?text=SCISSORS">
  </div>
  <div class="cpu-result">
    <img class="ai" src="//placehold.it/50x50/888?text=ROCK">
    <img class="ai" src="//placehold.it/50x50/eee?text=PAPER">
    <img class="ai" src="//placehold.it/50x50/0bf?text=SCISSORS">
  </div>
  
  <div id="result"></div>

</div>

和文本数据框,例如:

variablenames <- c("strong","weak","happy","sad")

认为我想要这样的东西:

library(tidyverse)
textdf <- as.data.frame("Happy was a dwarf who was perpetually sad.") %>% rename(text = 1)

但是我很确定那是行不通的。预期的输出是:

countstring_fn <- function(variablenames,textdf){
for(term in variablenames){
paste0(term,"count") <- str_count(term,textdf)
}
}

有没有人做过这样的事情并使它起作用?

4 个答案:

答案 0 :(得分:2)

这是另一种方式。

library(tidyverse)
variablenames <- c("strong", "weak", "happy", "sad")
textdf <- tibble(
  text = c(
    '"Happy was a dwarf who was perpetually sad."',
    '"If you\'re strong, you\'re not weak."'
  )
)
textdf[, str_c(variablenames, 'count')] <- do.call(
  rbind, 
  lapply(
    textdf$text, 
    function(df) { 
      str_count(toupper(df), toupper(variablenames)) 
    }
  )
)
invisible(
  apply(
    textdf, 
    1, 
    function(vec) {
      cat(str_c(str_c(vec, collapse = ','), '\n'))
    }
  )
)

这里的主要区别是textdf数据框中的字符串用双引号引起来(如果您是从.csv导入数据,则只需调用str_c('"', textdf$text, '"')即可获得相同的效果) 。然后,我们将所有文本和模式转换为大写,以确保找到所有匹配项。最后,我们可以调用str_count()以获取计数的整数向量,可以通过定义所需的列名将其分别分配给特定的列。

然后,prntFunc函数使用apply()将数据框中的每一行打印到控制台(矢量化比使用for循环要快):

"Happy was a dwarf who was perpetually sad.",0,0,1,1
"If you're strong, you're not weak.",1,1,0,0

我们首先使用str_c()来使其崩溃。换句话说,我们可以用,作为分隔符,将一行中所有五列中的字符串连接成一个字符串。然后,对于cat(),我们需要再次使用\n在每个“行字符串”的末尾附加一个换行符(str_c())。最后,我们可以调用cat()在控制台中显示带有特殊字符的字符串,例如",而不带转义符(\)。 cat()调用用invisible()包装,以抑制NULL在交互式调用时附加到末尾的cat()

答案 1 :(得分:1)

我们可以将text转换为小写,并检查每个文本中variablenames的出现,并返回一个逗号分隔的字符串。我们为每个\\b添加单词边界(variablenames),以避免将“悲伤”与“增加”相匹配。然后,我们可以separate将数据分成不同的列

library(tidyverse)

textdf %>%
   mutate(count = map_chr(tolower(text), function(x) 
    toString(map_int(paste0("\\b",variablenames,"\\b"), ~str_count(x, .x))))) %>%
  separate(count, into = paste0(variablenames, "_count"), sep = ",", convert = TRUE)

#                                        text strong_count weak_count happy_count sad_count
#1 Happy was a dwarf who was perpetually sad.            0          0           1         1

答案 2 :(得分:1)

# added second row to show output of function

textdf <- structure(list(text = c("Happy was a dwarf who was perpetually sad.",
"Sad was a dwarf who was perpetually sad.")), row.names = c(NA,
-2L), class = "data.frame")

# counting the occurrences of words in 'variablenames'

pmap_df(
  textdf, function(text) {
    map(variablenames, ~ str_count(tolower(text), pattern = .)) %>%
    t %>% as.data.frame
  }
) %>%
  setNames(variablenames) %>%
  bind_cols(textdf, .)

# Leaves you with a data frame with counts for each word as columns.

                                        text strong weak happy sad
1 Happy was a dwarf who was perpetually sad.      0    0     1   1
2   Sad was a dwarf who was perpetually sad.      0    0     0   2


答案 3 :(得分:1)

另一种方式:

library(tidyverse)

t(sapply(dat$strgs, str_count, pattern = coll(patts, T, 'en'))) %>%
  data.frame %>%
  set_names(., patts) %>%
  bind_cols(dat, .)

#   strgs                                strength ignorance present future collapse
# 1 War Is Peace, Freedom Is Slavery...  1        1         0       0      0
# 2 Who controls the past controls t...  0        0         1       1      0
# 3 The collapse of the USSR was the...  0        0         0       0      1

数据:

patts <- c("strength", "ignorance", "present", "future", "collapse")

dat <- data.frame(
  strgs = c(
    "War Is Peace, Freedom Is Slavery, and Ignorance Is Strength.",
    "Who controls the past controls the future: who controls the present controls the past.",
    "The collapse of the USSR was the greatest geopolitical catastrophe of the century."
  )
)