在R中爬行时,将表格的体系结构与一个单元格中的多个元素保持在一起

时间:2016-06-02 08:36:20

标签: r dataframe webpage

在网页中,网页中有一种表格,在一个单元格中有多个元素。我可以通过以下代码抓取表中的内容,但我无法将这些元素绑定为其网页架构。我们是否有一些方法可以完美地组合这些元素,或者我们应该使用其他想法来获取每个元素?

    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <time.h>
    #include <pthread.h>
    #include <semaphore.h>

    sem_t s;
    typedef struct Data Data;
    struct Data {
        pthread_t* a;
        int index;
        int j; 
    };
    void* someFunction(void* arg){ 
        /* Only at most num_threads should be here at once; */
        sem_wait(&s);
        Data* d = arg;
        printf("Successfully completed task %d with thread %d\n", d->index, d->j);
        sleep(2);   
        pthread_t* z = d->a;
        free(d);
        pthread_join(*z, NULL);
        sem_post(&s);
        return 0;
    }  
    int main(void){
        int num_task = 15; // i need to call someFunction() 9000 times
        int num_threads = 4; 
        int j = 0;
        sem_init(&s, 0, num_threads);
        pthread_t thread_ids[num_threads];
        for (int i = 0; i < num_task; i ++){ 
            /*NEED TO COMPLETE num_tasks using four threads;
            4 threads can run someFunction() at the same time; so one all four are currently executing someFunction(), other threads can't enter until one has completed. */
            if (j == num_threads){
                j = 0; // j goes 0 1 2 3 0 1 2 3 ...
            }
            Data* a = malloc(sizeof(Data));
            a->a = thread_ids + j;
            a->index = i;
            a->j = j;
            sem_wait(&s);
            pthread_create(thread_ids + j, NULL, someFunction, a);
            sem_post(&s); 
            j ++;
        }
        return 0;
    }

我已经尝试了很多次,但仍然无法按原样组织它们,就像一篇论文可能有几位作者一样,所有作者及其链接应保存在一个“行”中,但现在一位作者在一行中,纸的标题完全重复使用。这使结果搞砸了。

2 个答案:

答案 0 :(得分:0)

这是从该表中生成长数据帧的一种方法:

library(rvest)
library(purrr)
library(tibble)

pg <- read_html("http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued")

# extract the columns

col1 <- html_nodes(pg, "td[headers='t1']")
col2 <- html_nodes(pg, "td[headers='t2']")
col3 <- html_nodes(pg, "td[headers='t3']")

# this is the way to get the full text column

col4 <- html_nodes(pg, "td[headers='t3'] + td")

# now, iterate over the rows; map_df() will bind all our data.frame's together

map_df(1:legnth(col1), function(i) {

  # extract the links

  a1 <- xml_nodes(col1[i], "a") 
  a2 <- xml_nodes(col2[i], "a")
  a4 <- xml_nodes(col4[i], "a")

  # put the row into a long data.frame for the row

  data_frame(      title = html_text(a1, trim=TRUE),
              title_link = html_attr(a1, "href"),
                  author = html_text(a2, trim=TRUE),
             author_link = html_attr(a2, "href"),
              issue_date = html_text(col3[i], trim=TRUE),
               full_text = html_attr(a4, "href"))

})

答案 1 :(得分:0)

使用期间最大的问题&#34; rvest&#34;包是乱码。甚至参数&#34;编码&#34;已经在程序中使用,结果仍然有乱码。但网页编码是UTF-8。如:

library(rvest)
pg <- read_html("http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued", encoding = "UTF-8")

对于我的测试,最佳性能应该是&#34; XML&#34;,当我使用getNodeset函数时,结果是正确的,根本没有乱码。但是,我只获取整个节点,并且不能将每一行表格与其结构相结合。

library(XML)
pg <- "http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued"
pg_tables <- getNodeSet(htmlParse(pg), "//table[@summary='This table browse all dspace content']")
# gether the node of whole table
papernode <- getNodeSet(pg_tables[[1]], "//td[@headers='t1']")
paper_hrefs <- xpathSApply(papernode[[1]], '//a/@href')
paper_name <- xpathSApply(papernode[[1]], '//a', xmlValue)
# gether authors in table
authnode <- getNodeSet(pg_tables[[1]], "//td[@headers='t2']")
# gether date in table
datenode <- getNodeSet(pg_tables[[1]], "//td[@headers='t3']")

通过这个程序,我可以得到这些&#34;节点&#34; separatly。但是,抓取标题及其链接似乎越来越难。因为&#34; getNodeSet&#34;的结果类与&#34; html_nodes&#34;不同我们如何读取&#34; getNodeSet&#34;生成的数据帧?自动并以精确的方式从这些节点中提取标题及其链接?