我正在使用世界粮食计划署国家网站(http://www1.wfp.org/countries)瞄准网络抓取它,以便建立一个包含定期发布的新闻的数据集,而不会逐页点击每页。 此外,我会添加一些列,包括关键字计数。 撇开包含国家和网址的脚本部分,我会把重点放在抓取本身上。 然而,我正在使用一堆软件包。
vfork()
我为另一个网站准备了数据集,似乎效果很好。 这里的帮手为这件事提出了一个非常优雅的解决方案,我把它与我以前在国家部分的工作相结合,一切都运作良好。然而,该解决方案似乎不符合我目前的需要。 然而,我有这个:
library(rvest)
library(stringr)
library(tidyr)
library(data.table)
library(plyr)
library(xml2)
library(selectr)
library(tibble)
library(purrr)
library(datapasta)
library(jsonlite)
library(countrycode)
library(httr)
library(stringi)
library(tidyverse)
library(dplyr)
library(XML)
特别是,当我运行第14点时,如果是脚本,我会收到以下错误消息:
## 11. Creating a function in order to scrape data from a website (in this case, WFP's)
wfp_get_news <- function(iso3) { GET(
url = "http://www1.wfp.org/countries/common/allnews/en/",
query = list(iso3=iso3)
) -> res
warn_for_status(res)
if (status_code(res) > 399) return(NULL)
out <- content(res, as="text", encoding="UTF-8")
out <- jsonlite::fromJSON(out)
out$iso3 <- iso3
tbl_df(out)
}
## 12. Setting all the Country urls in order for them to be automatically scraped
pb <- progress_estimated(length(countrycode_data$iso3c[])) # THIS TAKES LONG TO BE PROCESSED
map_df(countrycode_data$iso3c[], ~{
pb$tick()$print()
Sys.sleep(5)
wfp_get_news(.x)
}) -> xdf
## 13. Setting keywords (of course, this process is arbitrary: one can decide any keywor s/he prefers)
keywords <- c("drought", "food security")
keyword_regex <- sprintf("(%s)", paste0(keywords, collapse="|"))
## 14. Setting the keywords search
bind_cols(
xdf,
stri_match_all_regex(tolower(xdf$bodytext), keyword_regex) %>%
map(~.x[,2]) %>%
map_df(~{
res <- table(.x, useNA="always")
nm <- names(res)
nm <- ifelse(is.na(nm), "NONE", stri_replace_all_regex(nm, "[ -]", "_"))
as.list(set_names(as.numeric(res), nm))
})
) %>%
select(-NONE) -> xdf_with_keyword_counts
预期结果应该或多或少,而不是:
Error in overscope_eval_next(overscope, expr) :
object "NONE" not found
Furthermore: Warning message:
Unknown or uninitialised column: 'bodytext'.
我希望我说得很清楚。 任何线索?
答案 0 :(得分:1)
我认为你击中了其中一个&#34;陷阱&#34;在网页抓取中:他们删除了网站上的这个功能/路径。
尝试前往http://www1.wfp.org/countries/common/allnews/en/iso=SLV(萨尔瓦多的新闻页面来自您之前使用过cpl的网址计划)。它不存在。
但是,如果您转到http://www1.wfp.org/countries/el-salvador,该页面上的http://www.wfp.org/news/el-salvador-177链接就是萨尔瓦多的新闻项目。
我认为它是相同的内容,只是以不同的方式呈现,所以它只是以不同的方式攻击它:
library(rvest)
library(httr)
library(stringi)
library(tidyverse)
这是一个帮手,所以我们可以获得他们的国家/地区ID和名称映射:
get_countries <- function() {
pg <- read_html("http://www.wfp.org/news/news-releases?tid=All&tid_2=All")
# find the country popup
country_sel <- html_nodes(pg, "select[name='tid'] option")
# extract ids and name for each country, ignoring "All"
data_frame(
cid = html_attr(country_sel, "value"),
cname = html_text(country_sel)
) %>%
filter(stri_detect_regex(cid, "[[:digit:]]"))
}
这是在网页上获取新闻内容的帮手
get_news <- function(cid, tid) {
GET("http://www.wfp.org/news/news-releases",
query=list(tid=cid, tid_2=tid)) -> res
warn_for_status(res)
if (status_code(res) > 200) return(NULL)
res <- content(res, as="parsed")
# check for no stories by testing for the presence of the
# div that has the "no stories are found" text
if (length(html_node(res, "div.view-empty")) != 0) return(NULL)
# find the news item boxes on this page
items <- html_nodes(res, "div.list-page-item")
# extract the contents
data_frame(
cid = cid,
tid = tid,
# significant inconsistency in how they assign CSS classes to date boxes
date = html_text(html_nodes(items, xpath=".//div[contains(@class, 'box-date')]"), trim=TRUE),
title = html_text(html_nodes(items, "h3"), trim=TRUE),
# how & where they put summary text in the div is also inconsistent so we
# need to (unfortunately) include the date and title to ensure we capture it
# we cld get just the text, but it's more complex code.
summary = html_text(items, trim=TRUE),
link = html_attr(html_nodes(items, "h3 a"), "href")
)
}
现在,我们遍历这些国家并获取所有故事:
country_df <- get_countries()
pb <- progress_estimated(length(country_df$cid))
map_df(country_df$cid, ~{
pb$tick()$print()
get_news(.x, "All")
}) -> news_df
# add in country names
mutate(news_df, cid = as.character(cid)) %>%
left_join(country_df) -> news_df
glimpse(news_df)
## Observations: 857
## Variables: 7
## $ cid <chr> "120", "120", "120", "120", "120", "120", "120", "120", "120", "120"...
## $ tid <chr> "All", "All", "All", "All", "All", "All", "All", "All", "All", "All"...
## $ date <chr> "26 October 2017", "16 October 2017", "2 October 2017", "10 July 201...
## $ title <chr> "US Contribution To Boost WFP Food Assistance And Local Economy In A...
## $ summary <chr> "26 October 2017\t\t\r\n\t\t\r\n\tUS Contribution To Boost WFP Food ...
## $ link <chr> "/news/news-release/us-contribution-boost-wfp-food-assistance-and-lo...
## $ cname <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghani...
您仍然需要尝试通过调整您拥有的其他代码对此进行分类,并且您可以使用数据框中的link
来挖掘更多文本以进行所述分类。
注意:这只会获取每个国家/地区的最新新闻页面,但这几乎是您想要做的事情(请查看net-new&amp; classify)。
现在,我们可以尝试通过循环播放国家/地区来自动对故事进行分类。弹出主题列表,因为这些主题似乎是你关心的(其中一些)。你需要相信他们标记的东西很好。
注意:这将花费长时间,特别是&#34;善良&#34;延迟因此我为什么只支持它,并且没有在轻度测试中运行它以确保它有效:
# get topic ids
get_topics <- function() {
pg <- read_html("http://www.wfp.org/news/news-releases?tid=All&tid_2=All")
# find the topic popup
country_sel <- html_nodes(pg, "select[name='tid_2'] option")
# extract ids and name for each topic, ignoring "All" and sub-topics
# i.e. ignore ones that begin with "-"
data_frame(
tid = html_attr(country_sel, "value"),
tname = html_text(country_sel)
) %>%
filter(stri_detect_regex(tid, "[[:digit:]]")) %>%
filter(tid != "All") # exclude "All" since we're trying to auto-tag
}
topics_df <- get_topics()
pb <- progress_estimated(length(country_df$cid))
map_df(country_df$cid, ~{
pb$tick()$print()
cid <- .x
Sys.sleep(5) ## NOTE THIS SHOULD REALLY GO IN get_news() but I didn't want to mess with that function for this extra part of the example
map_df(topics_df$tid, ~get_news(cid, .x))
}) -> news_with_tagged_topics_df
mutate(news_with_tagged_topics_df, tid = as.character(tid), cid = as.character(cid)) %>%
left_join(topics_df) %>%
left_join(country_df) %>%
glimpse()
我为3个国家的随机样本运行它:
## Observations: 11
## Variables: 8
## $ cid <chr> "4790", "4790", "4790", "4790", "4790", "4790", "4790", "152", "152"...
## $ tid <chr> "4488", "3929", "3929", "995", "999", "1005", "1005", "997", "995", ...
## $ date <chr> "16 December 2014", "2 September 2016", "1 October 2014", "1 October...
## $ title <chr> "Russia & WFP Seal Partnership To End Hunger; Kamaz Trucks Rolled Ou...
## $ summary <chr> "16 December 2014\t\t\r\n\t\t\r\n\tRussia & WFP Seal Partnership To ...
## $ link <chr> "/news/news-release/russia-wfp-seal-partnership-end-hunger-kamaz-tru...
## $ tname <chr> "Executive Director", "Centre of Excellence against Hunger", "Centre...
## $ cname <chr> "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil"...
它确实获得了多种标签:
unique(news_with_tagged_topics_df$tname)
## [1] "Executive Director" "Centre of Excellence against Hunger"
## [3] "Nutrition" "Procurement"
## [5] "School Meals" "Logistics"