我正在尝试从我抓取的一段代码中检索深度和宽度等信息,但我在执行此操作时遇到了麻烦。
obtain_url <- html(# Some url)
test <- obtain_url %>% html_node("#specifications") %>% html_text()
edit(test)
Dimensions:\n \n Width (in.):\n 30\n \n \n \n Depth (in.):\n 24.25\n \n \n \n Width:\n 30 inches\n \n \n \n Weight (lbs.):\n 320\n \n \n \n Height (in.):\n 50.5\n
dn<-sub(".*Width (in.):\n(.*)\n .*","\\1",test) # My attempt at retrieving width info
我的尝试只是简单地吐出相同的文字。我感兴趣的所有信息始终以相同的模式显示Info:\n #36 blank spaces# Information\n
。有时它是一个数字,有时它只是常规文本。如果某人可以帮助我检索,例如,宽度和深度的数值,我可以将其应用于其他所有内容。
答案 0 :(得分:2)
我会尝试strsplit
。
clean <- function(x) {
s <- strsplit(x, '\\n')
s2 <- gsub('\\s{2,}', '', s[[1]])
indx <- grep(':', s2)
paste(s2[indx], s2[indx+1])
}
clean(x)
[1] "Dimensions: " "Width (in.): 30" "Depth (in.): 24.25"
[4] "Width: 30 inches" "Weight (lbs.): 320" "Height (in.): 50.5"
如果您不需要该文本,请尝试以下方法:
clean2 <- function(x, measure) {
s <- strsplit(x, '\\n')
s2 <- gsub('\\s{2,}', '', s[[1]])
indx <- grep(':', s2)
res <- s2[indx+1]
num <- as.numeric(gsub('[^0-9\\.]', '', res, perl=T))
num
}
clean2(x)
[1] NA 30.00 24.25 30.00 320.00 50.50
或者在我看来更好:
clean3 <- function(x, measure) {
s <- strsplit(x, '\\n')
s2 <- gsub('\\s{2,}', '', s[[1]])
indx <- grep(':', s2)
res <- s2[indx+1]
num <- as.numeric(gsub('[^0-9\\.]', '', res, perl=T))
df <- data.frame(Measure=s2[indx], Value=num)
df
}
# clean3(x)
# Measure Value
# 1 Dimensions: NA
# 2 Width (in.): 30.00
# 3 Depth (in.): 24.25
# 4 Width: 30.00
# 5 Weight (lbs.): 320.00
# 6 Height (in.): 50.50
答案 1 :(得分:1)
text <- "Dimensions:\n \n Width (in.):\n 30\n \n \n \n Depth (in.):\n 24.25\n \n \n \n Width:\n 30 inches\n \n \n \n Weight (lbs.):\n 320\n \n \n \n Height (in.):\n 50.5\n "
no_spaces <- gsub("\\n|\\s","",text)
width <- as.numeric(sub(".+Width\\(in\\.\\)\\:(\\d+\\.?\\d?).*",("\\1"),no_spaces)) #30
depth <- as.numeric(sub(".+Depth\\(in\\.\\)\\:(\\d+\\.?\\d?).*",("\\1"),no_spaces)) #24.2
正则表达式是一种痛苦,因为你必须引用括号,缩写句点,可选的小数点等等。但它似乎有效。 HTH