R中使用udpipe提取关键字时的for循环

时间:2018-10-28 08:27:02

标签: r for-loop keyword udpipe

让我们从一个可重现的示例开始,它是一个由8列和3行组成的名为key的数据帧:

key <- structure(c("Make Professional Maps with QGIS and Inkscape", 
"Gain the skills to produce original, professional, and aesthetically pleasing maps using free software", 
"English", "Inkscape 101 for Beginners - Design Vector Graphics", 
"Learn how to create and design vector graphics for free!", "English", 
"Design & Create Vector Graphics With Inkscape 2016", "The Beginners Guide to designing and creating Vector Graphics with Inkscape. No Experience needed!", 
"English", "Design a Logo for Free in Inkscape", "Learn from an award winning, published logo design professional!", 
"English", "Inkscape - Beginner to Pro", "If you want to have a decent learning curve, you are new to the program or even in design, this course is for you.", 
"English", "Creating 2D Textures in Inkscape", "A guide to creating colorful and interesting textures in inkscape.", 
"English", "Vector Art in Inkscape - Icon Design | Make Vector Graphics", 
"Learn Icon Design by creating Vector Graphics using the .SVG and PNG format with the Free Software Inkscape!", 
"English", "Inkscape and Bootstrap 3 -> Responsive Web Design!", 
"Design responsive websites using Free tools Inkscape and Bootstrap 3! Mood Boards and Style Tiles to Mobile First!", 
"English"), .Dim = c(3L, 8L), .Dimnames = list(c("Title", "Short_Description", 
"Language"), c("1", "2", "4", "5", "6", "9", "13", "15")))

我想分别提取每个列的关键字。为此,我使用了R中的udpipe包。

因为我想在每一列中运行函数,所以我运行了for循环。

在开始之前,我们以英语为参考(see this link for more info)创建模型:

library(udpipe)
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)

理想情况下,我的最终输出将是一个具有8列的数据框,并且提取了很多行作为关键字。

我尝试了两种方法:

方法1:使用dplyr

library(dplyr)
keywords <- list()
for(i in ncol(keywords_en_t)){
  keywords[[i]] <- keywords_en_t %>%
    udpipe_annotate(ud_model,s)
    as.data.frame()
}

方法2:

key <- list()
stats <- list()
for(i in ncol(keywords_en_t)){
    key[[i]] <- as.data.frame(udpipe_annotate(ud_model, x = keywords_en_t[,i]))
    stats[[i]] <- subset(key[[i]], upos %in% "NOUN")
    stats <- txt_freq(x = stats$lemma)
}

输出

在这两种情况下,或者我遇到一些错误,或者输出不是预期的。

如前所述,我期望的输出是一个数据帧,其中有8列,每行代表关键字

有什么主意吗?

1 个答案:

答案 0 :(得分:1)

不幸的是,您的代码包含很多错误。循环不是从1到列数,而是从8开始。可以使用1:ncolseq_along。 您的关键数据是矩阵,而不是data.frame。您需要提供udpipe_annotate个字符向量。如果仅提供密钥[,8],那么还将为udpipe_annotate提供暗号。这可能会生成您不需要的关键字。在方法1中,您使用udpipe_annotate(ud_model,s),但未定义s。在方法2中,您使用stats [[i]],并在后记后立即使用stats覆盖它。

要纠正某些问题,首先我将数据转换为data.frame。接下来,我运行循环以创建包含关键字的向量列表。之后,我创建了关键字的data.frame。代码的这一部分考虑了向量的不同长度。

您可能想检查数据的获取方式,因为具有3列(“标题”,“ Short_Description”,“语言”)和很多行比较合乎逻辑。

代码

# Transform key into a data.frame. Now it is a matrix. 
key <- as.data.frame(key, stringsAsFactors = FALSE)

library(udpipe)
# prevent downloading ud model if it already exists in the working directory
ud_model <- udpipe_download_model(language = "english", overwrite = FALSE)
ud_model <- udpipe_load_model(ud_model$file_model)

# prepare list with correct length
keywords <- vector(mode = "list", length = ncol(key))

for(i in 1:ncol(key)){
  temp <- as.data.frame(udpipe_annotate(ud_model, x = key[, i]))
  keywords[[i]] <- temp$lemma[temp$upos == "NOUN"]
}

#transform list of vectors to data.frame. 
# Use sapply because vectors are of different lengths.
keywords <- as.data.frame(sapply(keywords, '[', seq(max(lengths(keywords)))), stringsAsFactors = FALSE)

keywords

        V1        V2         V3     V4       V5       V6     V7      V8
1    skill beginners  beginners   logo learning       2d Design     web
2      map    design      guide  award    curve  Texture format  design
3 software    Vector experience   logo  program    guide   <NA>  design
4     <NA>  graphics       <NA> design   design  texture   <NA> website
5     <NA>    vector       <NA>   <NA>   course inkscape   <NA>    tool
6     <NA>   graphic       <NA>   <NA>     <NA>     <NA>   <NA>    <NA>