Question

我希望R在数据集的列中单词，并从网站返回一个值。我到目前为止的代码如下。因此，对于数据框列中的每个单词，它将转到网站并返回发音（例如，http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=word&stress=-s上的发音是“W ER1 D”）。我查看了网站的HTML，我不清楚我需要输入什么才能返回此值 - 它介于<tt>和</tt>之间，但其中有很多。我也不确定如何将这个值带入R.谢谢。

library(xml2)

for (word in df$word) {
  result <- read_html("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in="word"&stress=-s")
}

Answer 1

在R中解析HTML是一个棘手的任务。但有几种方法。如果HTML很好地转换为XML并且网站/ API总是返回相同的结构，那么您可以使用工具来解析XML。否则，您可以使用正则表达式并在HTML上调用stringr::str_extract()。

对于您的情况，使用XML工具获取您正在寻找的价值相当容易。确实有很多<tt>个标签但你想要的那个标签总是在第二个实例中，所以你可以拉出那个。

#load packages. dplyr is just to use the pipe %>% function
library(httr)
library(XML)
library(dplyr)

#test words
wordlist = c('happy', 'sad')

for (word in wordlist){
#build the url and GET the result
url <- paste0("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=",word,"&stress=-s")
h <- handle(url)
res <- GET(handle = h)

#parse the HTML
resXML <- htmlParse(content(res, as = "text"))

#retrieve second <tt>
print(getNodeSet(resXML, '//tt[2]') %>% sapply(., xmlValue))
#don't abuse your API
Sys.sleep(0.1)
}

>[1] "HH AE1 P IY0 ."
>[1] "S AE1 D ."

祝你好运！

编辑：此代码将返回一个数据帧：

#load packages. dplyr is just to use the pipe %>% function
library(httr)
library(XML)
library(dplyr)

#test words
wordlist = c('happy', 'sad')

#initializae the dataframe with pronunciation field
pronunciation_list <- data.frame(pronunciation=character(),stringsAsFactors = F)

#loop over the words
for (word in wordlist){
  #build the url and GET the result
  url <- paste0("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=",word,"&stress=-s")
  h <- handle(url)
  res <- GET(handle = h)

  #parse the HTML
  resXML <- htmlParse(content(res, as = "text"))

  #retrieve second <tt>
  to_add <- data.frame(pronunciation=(getNodeSet(resXML, '//tt[2]') %>% sapply(., xmlValue)))

  #bind the data
  pronunciation_list<- rbind(pronunciation_list, to_add)

  #don't abuse your API
  Sys.sleep(0.1)
}

将HTML读入R中

1 个答案: