我正试图通过对文本数据帧执行str_count的函数将术语的字符向量转换为变量,我不确定如何做到这一点。
给出如下矢量:
<div class="main-container">
<div class="score">
<span>You: <span id="PLScore">0</span></span>
<span>Computer: <span id="AIScore">0</span></span>
</div>
<div class="user-choice">
<img data-user="0" src="//placehold.it/50x50/888?text=ROCK">
<img data-user="1" src="//placehold.it/50x50/eee?text=PAPER">
<img data-user="2" src="//placehold.it/50x50/0bf?text=SCISSORS">
</div>
<div class="cpu-result">
<img class="ai" src="//placehold.it/50x50/888?text=ROCK">
<img class="ai" src="//placehold.it/50x50/eee?text=PAPER">
<img class="ai" src="//placehold.it/50x50/0bf?text=SCISSORS">
</div>
<div id="result"></div>
</div>
和文本数据框,例如:
variablenames <- c("strong","weak","happy","sad")
我认为我想要这样的东西:
library(tidyverse)
textdf <- as.data.frame("Happy was a dwarf who was perpetually sad.") %>% rename(text = 1)
但是我很确定那是行不通的。预期的输出是:
countstring_fn <- function(variablenames,textdf){
for(term in variablenames){
paste0(term,"count") <- str_count(term,textdf)
}
}
有没有人做过这样的事情并使它起作用?
答案 0 :(得分:2)
这是另一种方式。
library(tidyverse)
variablenames <- c("strong", "weak", "happy", "sad")
textdf <- tibble(
text = c(
'"Happy was a dwarf who was perpetually sad."',
'"If you\'re strong, you\'re not weak."'
)
)
textdf[, str_c(variablenames, 'count')] <- do.call(
rbind,
lapply(
textdf$text,
function(df) {
str_count(toupper(df), toupper(variablenames))
}
)
)
invisible(
apply(
textdf,
1,
function(vec) {
cat(str_c(str_c(vec, collapse = ','), '\n'))
}
)
)
这里的主要区别是textdf
数据框中的字符串用双引号引起来(如果您是从.csv导入数据,则只需调用str_c('"', textdf$text, '"')
即可获得相同的效果) 。然后,我们将所有文本和模式转换为大写,以确保找到所有匹配项。最后,我们可以调用str_count()
以获取计数的整数向量,可以通过定义所需的列名将其分别分配给特定的列。
然后,prntFunc
函数使用apply()
将数据框中的每一行打印到控制台(矢量化比使用for循环要快):
"Happy was a dwarf who was perpetually sad.",0,0,1,1
"If you're strong, you're not weak.",1,1,0,0
我们首先使用str_c()
来使其崩溃。换句话说,我们可以用,
作为分隔符,将一行中所有五列中的字符串连接成一个字符串。然后,对于cat()
,我们需要再次使用\n
在每个“行字符串”的末尾附加一个换行符(str_c()
)。最后,我们可以调用cat()
在控制台中显示带有特殊字符的字符串,例如"
,而不带转义符(\
)。 cat()
调用用invisible()
包装,以抑制NULL
在交互式调用时附加到末尾的cat()
。
答案 1 :(得分:1)
我们可以将text
转换为小写,并检查每个文本中variablenames
的出现,并返回一个逗号分隔的字符串。我们为每个\\b
添加单词边界(variablenames
),以避免将“悲伤”与“增加”相匹配。然后,我们可以separate
将数据分成不同的列
library(tidyverse)
textdf %>%
mutate(count = map_chr(tolower(text), function(x)
toString(map_int(paste0("\\b",variablenames,"\\b"), ~str_count(x, .x))))) %>%
separate(count, into = paste0(variablenames, "_count"), sep = ",", convert = TRUE)
# text strong_count weak_count happy_count sad_count
#1 Happy was a dwarf who was perpetually sad. 0 0 1 1
答案 2 :(得分:1)
# added second row to show output of function
textdf <- structure(list(text = c("Happy was a dwarf who was perpetually sad.",
"Sad was a dwarf who was perpetually sad.")), row.names = c(NA,
-2L), class = "data.frame")
# counting the occurrences of words in 'variablenames'
pmap_df(
textdf, function(text) {
map(variablenames, ~ str_count(tolower(text), pattern = .)) %>%
t %>% as.data.frame
}
) %>%
setNames(variablenames) %>%
bind_cols(textdf, .)
# Leaves you with a data frame with counts for each word as columns.
text strong weak happy sad
1 Happy was a dwarf who was perpetually sad. 0 0 1 1
2 Sad was a dwarf who was perpetually sad. 0 0 0 2
答案 3 :(得分:1)
另一种方式:
library(tidyverse)
t(sapply(dat$strgs, str_count, pattern = coll(patts, T, 'en'))) %>%
data.frame %>%
set_names(., patts) %>%
bind_cols(dat, .)
# strgs strength ignorance present future collapse
# 1 War Is Peace, Freedom Is Slavery... 1 1 0 0 0
# 2 Who controls the past controls t... 0 0 1 1 0
# 3 The collapse of the USSR was the... 0 0 0 0 1
数据:
patts <- c("strength", "ignorance", "present", "future", "collapse")
dat <- data.frame(
strgs = c(
"War Is Peace, Freedom Is Slavery, and Ignorance Is Strength.",
"Who controls the past controls the future: who controls the present controls the past.",
"The collapse of the USSR was the greatest geopolitical catastrophe of the century."
)
)