在网页中,网页中有一种表格,在一个单元格中有多个元素。我可以通过以下代码抓取表中的内容,但我无法将这些元素绑定为其网页架构。我们是否有一些方法可以完美地组合这些元素,或者我们应该使用其他想法来获取每个元素?
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>
#include <semaphore.h>
sem_t s;
typedef struct Data Data;
struct Data {
pthread_t* a;
int index;
int j;
};
void* someFunction(void* arg){
/* Only at most num_threads should be here at once; */
sem_wait(&s);
Data* d = arg;
printf("Successfully completed task %d with thread %d\n", d->index, d->j);
sleep(2);
pthread_t* z = d->a;
free(d);
pthread_join(*z, NULL);
sem_post(&s);
return 0;
}
int main(void){
int num_task = 15; // i need to call someFunction() 9000 times
int num_threads = 4;
int j = 0;
sem_init(&s, 0, num_threads);
pthread_t thread_ids[num_threads];
for (int i = 0; i < num_task; i ++){
/*NEED TO COMPLETE num_tasks using four threads;
4 threads can run someFunction() at the same time; so one all four are currently executing someFunction(), other threads can't enter until one has completed. */
if (j == num_threads){
j = 0; // j goes 0 1 2 3 0 1 2 3 ...
}
Data* a = malloc(sizeof(Data));
a->a = thread_ids + j;
a->index = i;
a->j = j;
sem_wait(&s);
pthread_create(thread_ids + j, NULL, someFunction, a);
sem_post(&s);
j ++;
}
return 0;
}
我已经尝试了很多次,但仍然无法按原样组织它们,就像一篇论文可能有几位作者一样,所有作者及其链接应保存在一个“行”中,但现在一位作者在一行中,纸的标题完全重复使用。这使结果搞砸了。
答案 0 :(得分:0)
这是从该表中生成长数据帧的一种方法:
library(rvest)
library(purrr)
library(tibble)
pg <- read_html("http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued")
# extract the columns
col1 <- html_nodes(pg, "td[headers='t1']")
col2 <- html_nodes(pg, "td[headers='t2']")
col3 <- html_nodes(pg, "td[headers='t3']")
# this is the way to get the full text column
col4 <- html_nodes(pg, "td[headers='t3'] + td")
# now, iterate over the rows; map_df() will bind all our data.frame's together
map_df(1:legnth(col1), function(i) {
# extract the links
a1 <- xml_nodes(col1[i], "a")
a2 <- xml_nodes(col2[i], "a")
a4 <- xml_nodes(col4[i], "a")
# put the row into a long data.frame for the row
data_frame( title = html_text(a1, trim=TRUE),
title_link = html_attr(a1, "href"),
author = html_text(a2, trim=TRUE),
author_link = html_attr(a2, "href"),
issue_date = html_text(col3[i], trim=TRUE),
full_text = html_attr(a4, "href"))
})
答案 1 :(得分:0)
使用期间最大的问题&#34; rvest&#34;包是乱码。甚至参数&#34;编码&#34;已经在程序中使用,结果仍然有乱码。但网页编码是UTF-8。如:
library(rvest)
pg <- read_html("http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued", encoding = "UTF-8")
对于我的测试,最佳性能应该是&#34; XML&#34;,当我使用getNodeset函数时,结果是正确的,根本没有乱码。但是,我只获取整个节点,并且不能将每一行表格与其结构相结合。
library(XML)
pg <- "http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued"
pg_tables <- getNodeSet(htmlParse(pg), "//table[@summary='This table browse all dspace content']")
# gether the node of whole table
papernode <- getNodeSet(pg_tables[[1]], "//td[@headers='t1']")
paper_hrefs <- xpathSApply(papernode[[1]], '//a/@href')
paper_name <- xpathSApply(papernode[[1]], '//a', xmlValue)
# gether authors in table
authnode <- getNodeSet(pg_tables[[1]], "//td[@headers='t2']")
# gether date in table
datenode <- getNodeSet(pg_tables[[1]], "//td[@headers='t3']")
通过这个程序,我可以得到这些&#34;节点&#34; separatly。但是,抓取标题及其链接似乎越来越难。因为&#34; getNodeSet&#34;的结果类与&#34; html_nodes&#34;不同我们如何读取&#34; getNodeSet&#34;生成的数据帧?自动并以精确的方式从这些节点中提取标题及其链接?