如何将输出存储到r中的for循环列表中

时间:2017-05-17 06:43:46

标签: r

我正在对网站进行网页抓取。当我从网站上获取数据时,每个页面都有10个观察结果。我正在编写一个函数,您可以指定不要抓取的页面,最后将其存储在列表中,然后将其转换为数据帧。

library(jsonlite)
forum_data_fetch <- function(no_of_pages) {

   pages <- seq(no_of_pages)
   #print(pages)
   forum_data <- list()

   for(i in 1:length(pages)){
       tmp <- fromJSON(paste("http://mmb.moneycontrol.com/index.php?q=topic/ajax_call&section=get_messages&offset=&lmid=&isp=0&gmt=cat_lm&catid=1&pgno=",i,sep=""))
       forum_data[[i]] <- tmp

  }

   dat <- as.data.frame(forum_data)
   dat <- dat[,c("msg_id","border_msg_count","user_id","border_level_text","follower_count", "topic", "tp_sector","tp_msg_count","heading", "flag", "price", "message")]

 return(dat)

}

 test <- forum_data_fetch(3)

理想情况下,上面的函数返回30个观察值,但它只返回10.我认为我将列表存储为data.frame时出错了。

2 个答案:

答案 0 :(得分:1)

以下是它的工作原理:

forum_data_fetch <- function(no_of_pages) {
  require(data.table)
  require(dplyr)
  pages <- seq(no_of_pages)
  forum_data <- list()

  for(i in 1:length(pages)){
    tmp <- fromJSON(paste("http://mmb.moneycontrol.com/index.php?q=topic/ajax_call&section=get_messages&offset=&lmid=&isp=0&gmt=cat_lm&catid=1&pgno=",i,sep=""))
    forum_data[[i]] <- tmp

  }
  cat("the length of forum_data is", length(forum_data), "\n")
  dat <- lapply(forum_data, as.data.frame) %>% rbindlist
  dat <- dat[,c("msg_id","border_msg_count","user_id","border_level_text","follower_count", "topic", "tp_sector","tp_msg_count","heading", "flag", "price", "message")]

  return(dat)

}

test <- forum_data_fetch(3)
dim(test)

控制台输出类似于

> test <- forum_data_fetch(3)
the length of forum_data is 3 
> dim(test)
[1] 30 12

答案 1 :(得分:1)

as.data.frame(forum_data)不是向现有列添加新行,而是添加具有相同名称的新列(即变量)。请改用do.call(rbind, forum_data)

dat1 <- as.data.frame(forum_data)
str(dat1)
# data.frame':  10 obs. of  219 variables:
# $ TOTAL_MSG_CNT             : int  50000 NA NA NA NA NA NA NA NA NA
# $ msg_id                    : chr  "47754017" "47754014" "47751119" "47746189" ...
# $ user_id                   : chr  "rajeshatharv" "bullbuffet" "csr93" "sanjiv3312" ...
# .... 

dat2 <- do.call(rbind, forum_data)
str(dat2)
# 'data.frame': 30 obs. of  73 variables:
#  $ TOTAL_MSG_CNT           : int  50000 NA NA NA NA NA NA NA NA NA ...
# $ msg_id                  : chr  "47754017" "47754014" "47751119" "47746189" ...
# $ user_id                 : chr  "rajeshatharv" "bullbuffet" "csr93" "sanjiv3312" ...
# ....

然后只需选择要使用的列。