我正在使用R包udpipe
在我的数据框中提取关键字。让我们从包中包含的一些数据开始:
library(udpipe)
data(brussels_reviews)
如果我们看一下结构,就会看到它包含1500条注释(行)和4列。
str(brussels_reviews)
'data.frame': 1500 obs. of 4 variables:
$ id : int 32198807 12919832 23786310 20048068 17571798 28394425 46322841 27719650 14512388 37675819 ...
$ listing_id: int 1291276 1274584 1991750 2576349 1866754 5247223 7925019 4442255 2863621 3117760 ...
$ feedback : chr "Gwen fue una magnifica anfitriona. El motivo de mi viaje a Bruselas era la busqueda de un apartamento y Gwen me"| __truncated__ "Aurelie fue muy atenta y comunicativa. Nos dio mapas, concejos turisticos y de transporte para disfrutar Brusel"| __truncated__ "La estancia fue muy agradable. Gabriel es muy atento y esta dispuesto a ayudar en todo lo que necesites. La cas"| __truncated__ "Excelente espacio, excelente anfitriona, un lugar accessible economicamente y cerca de los lugares turisticos s"| __truncated__ ...
$ language : chr "es" "es" "es" "es" ...
跟随this tutorial时,我可以一起提取所有数据框的关键字。很棒。
但是,我的要求是提取每一行中的关键字,而不是整个数据框。
我承认,在这个示例中,这没有多大意义,因为只有一列带有文本(feedback
)。但是,在我的实际示例中,我有很多带有文本的列。
因此,我想在数据框的每一行中提取关键字。因此,如果在此示例中提取关键字,我希望获得1500组关键字,每行每一组。
我该怎么办?
遵循这两个步骤,我们获得所有数据框的关键字。但是,我想在数据框的每一行中获取关键字。
library(udpipe)
library(textrank)
## First step: Take the Spanish udpipe model and annotate the text. Note: this takes about 3 minutes
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")
ud_model <- udpipe_download_model(language = "spanish")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback)
x <- as.data.frame(x)
## Collocation (words following one another)
stats <- keywords_collocation(x = x,
term = "token", group = c("doc_id", "paragraph_id", "sentence_id"),
ngram_max = 4)
## Co-occurrences: How frequent do words occur in the same sentence, in this case only nouns or adjectives
stats <- cooccurrence(x = subset(x, upos %in% c("NOUN", "ADJ")),
term = "lemma", group = c("doc_id", "paragraph_id", "sentence_id"))
## Co-occurrences: How frequent do words follow one another
stats <- cooccurrence(x = x$lemma,
relevant = x$upos %in% c("NOUN", "ADJ"))
## Co-occurrences: How frequent do words follow one another even if we would skip 2 words in between
stats <- cooccurrence(x = x$lemma,
relevant = x$upos %in% c("NOUN", "ADJ"), skipgram = 2)
答案 0 :(得分:0)
简单的for循环:
result <- NULL
for(i in 1:nrow(brussels_reviews)){
result[i] <- somefunction(brussels_reviews[i, 3])
}
以上代码是一种遍历brussels_reviews
所有行,将函数应用于第三列并将结果保存到向量的通用方法。这也可以包括列的嵌套循环。 (见下文)
如果您详细说明了确切使用的功能,我们将很乐意为您提供帮助。
k <- 1
result <- NULL
for(i in 1:nrow(df)){
for(j in 1:ncol(df)){
result[k] <- str_extract_all(df[i, j], "[A-Z]")
k <- k + 1
}
}
> head(result)
[[1]]
[1] "P" "W" "Y" "V" "L" "X" "Y" "E" "E" "V" "T" "X" "O" "O" "Y" "A" "W" "P"
[[2]]
[1] "Q" "J" "O" "J" "P" "S"
[[3]]
[1] "M" "E" "S" "I" "A" "Y" "J" "U" "M" "V" "W" "A" "P" "U" "I" "A" "X" "K"
[[4]]
[1] "T" "R" "H" "I" "S" "I"
[[5]]
[1] "N" "T" "L" "H" "U" "G" "B" "Z" "H" "U" "Y" "O" "W" "L" "F" "P" "O" "O"
[[6]]
[1] "S" "S" "L" "M" "T" "R"
# Function by A5C1D2H2I1M1N2O1R2T1
# https://stackoverflow.com/a/42734863/9406040
rstrings <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(a = paste(rstrings(100), rstrings(100),
rstrings(100)),
b = rstrings(100))
> head(df)
a b
1 PWYVL8045X YEEVT9271X OOYAW3194P QJOJP3673S
2 MESIA1348Y JUMVW0263A PUIAX6901K TRHIS9952I
3 NTLHU1254G BZHUY6075O WLFPO4360O SSLMT4848R
4 XIWRV0967X ERMLU3214U TNRSO3996F IJPTV3142Z
5 ESEKQ7976U RDDDK5322V ZZEJC7637W IBAJI6831N
6 PVDBQ3212K ZXDYV5256Z RVTPH3724W HTYYK5351R
答案 1 :(得分:0)
我遇到了类似您提到的问题。以下代码可能有用。
但是,如果在同一程序包中使用keywords_phrases
函数,则可以使用txt_recode_ngram
函数执行类似的操作。
library(data.table)
library(dplyr)
library(magrittr)
library(udpipe)
data("brussels_reviews_anno")
x <- brussels_reviews_anno
x <- as.data.table(x)
x <- subset(x, xpos %in% c("NN", "VB", "IN", "JJ"))
x <- x[, cooccurrence(lemma, order = FALSE), by = list(doc_id)]
x <- x %>%
group_by(doc_id) %>%
mutate(keywords = paste(term1, term2)) %>%
summarize(keywords = paste(keywords, collapse = ", "))