Question

我想练习网络报废并使用'R'和'rvest'包。现在我有一个125个元素的字符向量（p_text），并希望将其转换为数据帧。 25行5列，名称为q1，opt1，opt2，opt3，opt4。

所以列= q1中的元素1,5,10;列中的2,6,11 = opt1;栏中的3,7,12 = opt2;等等。

library(dplyr)    
library(rvest)

url <- 'http://upscfever.com/upsc-fever/en/test/en-test-sci1.html'

webpage <- read_html(url)

p_text <- webpage %>%
        html_nodes("label") %>%
        html_text()

怎么做？

Answer 1

转换为矩阵以正确排列事物，然后转换为数据框：

dat <- as.data.frame(matrix(p_text, ncol = 5, byrow = TRUE), stringsAsFactors = FALSE)
names(dat) <- c("q1", "opt1", "opt2", "opt3", "opt4")

str(dat)
## 'data.frame':   25 obs. of  5 variables:
##  $ q1  : chr  "Q1: Energy giving foods are " "Q2:Animal fats are categorized as" "Q3: Which is true" "Q4: Trans fats are" ...
##  $ opt1: chr  "Carbohydrates and fats" "saturated fatty acids" "saturated fatty acids are good for health" "unsaturated fats" ...
##  $ opt2: chr  "Carbohydrates and Proteins" "unsaturated fatty acids" "unsaturated fatty acids are harmful for health" "saturated fats" ...
##  $ opt3: chr  "Proteins and fats" "polyunsaturated fatty acids" "unsaturated fatty acids are good for health" "good for health" ...
##  $ opt4: chr  "carbohydrates, fats and proteins" "trans fats" "Animal fats are good for health" "animal fats" ...

如果要清理q1列，可能需要执行以下操作：

dat$q1 <- sub("^Q\\d{1,2}:[ ]?", "", dat$q1)

删除主要问题编号，冒号等

将字符向量转换为数据帧

1 个答案: