我想练习网络报废并使用'R'和'rvest'包。现在我有一个125个元素的字符向量(p_text),并希望将其转换为数据帧。 25行5列,名称为q1,opt1,opt2,opt3,opt4。
所以列= q1中的元素1,5,10;列中的2,6,11 = opt1;栏中的3,7,12 = opt2;等等。
library(dplyr)
library(rvest)
url <- 'http://upscfever.com/upsc-fever/en/test/en-test-sci1.html'
webpage <- read_html(url)
p_text <- webpage %>%
html_nodes("label") %>%
html_text()
怎么做?
答案 0 :(得分:0)
转换为矩阵以正确排列事物,然后转换为数据框:
dat <- as.data.frame(matrix(p_text, ncol = 5, byrow = TRUE), stringsAsFactors = FALSE)
names(dat) <- c("q1", "opt1", "opt2", "opt3", "opt4")
str(dat)
## 'data.frame': 25 obs. of 5 variables:
## $ q1 : chr "Q1: Energy giving foods are " "Q2:Animal fats are categorized as" "Q3: Which is true" "Q4: Trans fats are" ...
## $ opt1: chr "Carbohydrates and fats" "saturated fatty acids" "saturated fatty acids are good for health" "unsaturated fats" ...
## $ opt2: chr "Carbohydrates and Proteins" "unsaturated fatty acids" "unsaturated fatty acids are harmful for health" "saturated fats" ...
## $ opt3: chr "Proteins and fats" "polyunsaturated fatty acids" "unsaturated fatty acids are good for health" "good for health" ...
## $ opt4: chr "carbohydrates, fats and proteins" "trans fats" "Animal fats are good for health" "animal fats" ...
如果要清理q1
列,可能需要执行以下操作:
dat$q1 <- sub("^Q\\d{1,2}:[ ]?", "", dat$q1)
删除主要问题编号,冒号等