R子去除线路终结器

时间:2015-06-26 17:45:46

标签: r web-scraping substring

我正在尝试从我抓取的一段代码中检索深度和宽度等信息,但我在执行此操作时遇到了麻烦。

obtain_url <- html(# Some url)
test <-  obtain_url %>% html_node("#specifications") %>% html_text()
edit(test)


Dimensions:\n                            \n                                    Width (in.):\n                                    30\n                                \n                                \n                                \n                                    Depth (in.):\n                                    24.25\n                                \n                                \n                                \n                                    Width:\n                                    30 inches\n                                \n                                \n                                \n                                    Weight (lbs.):\n                                    320\n                                \n                                \n                                \n                                    Height (in.):\n                                    50.5\n 

dn<-sub(".*Width (in.):\n(.*)\n .*","\\1",test) # My attempt at retrieving width info

我的尝试只是简单地吐出相同的文字。我感兴趣的所有信息始终以相同的模式显示Info:\n #36 blank spaces# Information\n。有时它是一个数字,有时它只是常规文本。如果某人可以帮助我检索,例如,宽度和深度的数值,我可以将其应用于其他所有内容。

2 个答案:

答案 0 :(得分:2)

我会尝试strsplit

clean <- function(x) {
  s <- strsplit(x, '\\n')
  s2 <- gsub('\\s{2,}', '', s[[1]])
  indx <- grep(':', s2)
  paste(s2[indx], s2[indx+1])
}

clean(x)
[1] "Dimensions: "       "Width (in.): 30"    "Depth (in.): 24.25"
[4] "Width: 30 inches"   "Weight (lbs.): 320" "Height (in.): 50.5"

如果您不需要该文本,请尝试以下方法:

clean2 <- function(x, measure) {
  s <- strsplit(x, '\\n')
  s2 <- gsub('\\s{2,}', '', s[[1]])
  indx <- grep(':', s2)
  res <- s2[indx+1]
  num <- as.numeric(gsub('[^0-9\\.]', '', res, perl=T))
  num
}

clean2(x)
[1]     NA  30.00  24.25  30.00 320.00  50.50

或者在我看来更好:

clean3 <- function(x, measure) {
s <- strsplit(x, '\\n')
s2 <- gsub('\\s{2,}', '', s[[1]])
indx <- grep(':', s2)
res <- s2[indx+1]
num <- as.numeric(gsub('[^0-9\\.]', '', res, perl=T))
df <- data.frame(Measure=s2[indx], Value=num)
df
}

# clean3(x)
#          Measure  Value
# 1    Dimensions:     NA
# 2   Width (in.):  30.00
# 3   Depth (in.):  24.25
# 4         Width:  30.00
# 5 Weight (lbs.): 320.00
# 6  Height (in.):  50.50

答案 1 :(得分:1)

text <- "Dimensions:\n                            \n                                    Width (in.):\n                                    30\n                                \n                                \n                                \n                                    Depth (in.):\n                                    24.25\n                                \n                                \n                                \n                                    Width:\n                                    30 inches\n                                \n                                \n                                \n                                    Weight (lbs.):\n                                    320\n                                \n                                \n                                \n                                    Height (in.):\n                                    50.5\n "

no_spaces <- gsub("\\n|\\s","",text)

width <- as.numeric(sub(".+Width\\(in\\.\\)\\:(\\d+\\.?\\d?).*",("\\1"),no_spaces)) #30
depth <- as.numeric(sub(".+Depth\\(in\\.\\)\\:(\\d+\\.?\\d?).*",("\\1"),no_spaces)) #24.2

正则表达式是一种痛苦,因为你必须引用括号,缩写句点,可选的小数点等等。但它似乎有效。 HTH