我有一个div
,带有2个p
标签。
我需要获取此p
元素的第二个文本。
<div class="fb-price-list">
<p class="fb-price">S/ 1,699 (Internet)</p>
<p class="fb-price">S/ 2,399 (Normal)</p>
</div>
预期结果:
S/ 2,399 (Normal)
我有这个但不起作用:
tvs_url <- read_html("https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1")
product_price_actual <- tvs_url %>%
html_nodes('div.pod-group pod-group__large-pod div.pod-body div.fb-price-list p.fb-price:nth-child(2)') %>%
html_text()
html:
<div class="pod-item"><div class="fb-form__input--checkbox fb-pod__item__compare"><input id="fb-pod__item__input-16754140" class="fb-pod__item__compare__input" type="checkbox" name="fb-pod__item__input-16754140" value="16754140"><label for="fb-pod__item__input-16754140" class="fb-pod__item__compare__label">Comparar</label></div><div class="pod-head"><a class="pod-head__image" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="content__image"><img src="//falabella.scene7.com/is/image/FalabellaPE/16754140?wid=544&hei=544&qlt=70&anchor=750,750&crop=0,0,0,0" alt="img" class="image"></div></a><a href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140" class="pod-head__stickerslink"><div class="pod-head__stickers"><div class="fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff" data-discount-content="">29%</div></div></a></div><div class="pod-body"><a class="section__pod-top" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="section__pod-top-brand">SAMSUNG</div><div class="section__pod-top-title"><div class="LinesEllipsis ">LED UHD 4K 55" Smart TV UN55RU7100GXPE SERIE RU7100<wbr></div></div></a><div class="section__pod-middle"><div class="section__pod-middle-content__stickers"><div class="fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff" data-discount-content="">29%</div></div><div class="section__information"><a class="section__information-link" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="fb-price-list"><p class="fb-price">S/ 1,699 (Internet)</p><p class="fb-price">S/ 2,399 (Normal)</p></div></a></div><div class="section__pod-middle-content__button"><button class="btn-add-to-basket">AGREGAR A TU BOLSA</button></div></div><div class="section__pod-bottom"><div class="fb-pod__rating" style="visibility: hidden;"><a href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140#comments"><div class="fb-rating-stars"><div class="fb-rating-stars__container"><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><p class="fb-rating-stars__count">0 <span class="fb-rating-stars__count__max"> / 5</span></p></div></div></a></div><a class="section__pod-bottom-descriptionlink" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><ul class="section__pod-bottom-description"><li>Modelo: UN55RU7100GXPE</li><li>Tamaño de la pantalla: 55"</li><li>Resolución: 4K Ultra HD</li><li>Tecnología: Led</li><li>Conexión bluetooth: Sí</li></ul></a></div></div></div>
更新1:
根据选择的答案,我使用ifelse
检查给定位置的字符数:
要监管的头寸是第4位,当没有precio_antes(价格之前)时,该头寸被另一个元素占据,因此在这种情况下,我们需要放置NA
:
ifelse(nchar(sapply(splitted, "[", 4))>3, NA, sapply(splitted, "[", 6))
我如何构建最终的df:
df <- data.frame(
brand = sapply(splitted, "[", 2), #We don't need the "comparar" text so we start from 2
product = sapply(splitted, "[", 3),
precio_antes = ifelse(nchar(sapply(splitted, "[", 4))>3, NA, sapply(splitted, "[", 6)),
precio_actual = ifelse(nchar(sapply(splitted, "[", 4))<=3, sapply(splitted, "[", 5), sapply(splitted, "[", 4))
)
答案 0 :(得分:1)
在这里,我使用CSS选择类为fb-price-list
的节点,然后选择第二个p
子节点:
library(rvest)
"<div class=\"pod-item\"><div class=\"fb-form__input--checkbox fb-pod__item__compare\"><input id=\"fb-pod__item__input-16754140\" class=\"fb-pod__item__compare__input\" type=\"checkbox\" name=\"fb-pod__item__input-16754140\" value=\"16754140\"><label for=\"fb-pod__item__input-16754140\" class=\"fb-pod__item__compare__label\">Comparar</label></div><div class=\"pod-head\"><a class=\"pod-head__image\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"content__image\"><img src=\"//falabella.scene7.com/is/image/FalabellaPE/16754140?wid=544&hei=544&qlt=70&anchor=750,750&crop=0,0,0,0\" alt=\"img\" class=\"image\"></div></a><a href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\" class=\"pod-head__stickerslink\"><div class=\"pod-head__stickers\"><div class=\"fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff\" data-discount-content=\"\">29%</div></div></a></div><div class=\"pod-body\"><a class=\"section__pod-top\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"section__pod-top-brand\">SAMSUNG</div><div class=\"section__pod-top-title\"><div class=\"LinesEllipsis \">LED UHD 4K 55\" Smart TV UN55RU7100GXPE SERIE RU7100<wbr></div></div></a><div class=\"section__pod-middle\"><div class=\"section__pod-middle-content__stickers\"><div class=\"fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff\" data-discount-content=\"\">29%</div></div><div class=\"section__information\"><a class=\"section__information-link\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"fb-price-list\"><p class=\"fb-price\">S/ 1,699 (Internet)</p><p class=\"fb-price\">S/ 2,399 (Normal)</p></div></a></div><div class=\"section__pod-middle-content__button\"><button class=\"btn-add-to-basket\">AGREGAR A TU BOLSA</button></div></div><div class=\"section__pod-bottom\"><div class=\"fb-pod__rating\" style=\"visibility: hidden;\"><a href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140#comments\"><div class=\"fb-rating-stars\"><div class=\"fb-rating-stars__container\"><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><p class=\"fb-rating-stars__count\">0 <span class=\"fb-rating-stars__count__max\"> / 5</span></p></div></div></a></div><a class=\"section__pod-bottom-descriptionlink\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><ul class=\"section__pod-bottom-description\"><li>Modelo: UN55RU7100GXPE</li><li>Tamaño de la pantalla: 55\"</li><li>Resolución: 4K Ultra HD</li><li>Tecnología: Led</li><li>Conexión bluetooth: Sí</li></ul></a></div></div></div>" %>%
read_html() %>%
html_nodes(".fb-price-list p:nth-child(2)") %>%
html_text()
答案 1 :(得分:1)
tl; dr
内容是动态加载的,但可以作为字符串使用,源是javascript字典,可以在regex之后使用json解析器进行解析以获取字符串。 This是当前提取的json。
如果使用 F12 打开开发工具并检查html页面,您将看到script
标签,其中包含可通过json解析器提取和处理的javascript字典。这确实意味着您可以定位显示的script
标记,然后从节点和子字符串中提取文本,但是我更喜欢在字符串上使用正则表达式(请参见我将主体提取为字符串。通常不建议在HTML中使用Regex,但是使用字符串就可以了) )。
代码输出:
json$state$searchItemList$resultList$prices
为您提供由数据帧组成的长度为32的列表。您可以看到,每个数据框originalPice
中都有您想要的信息(label
列== (Normal)
所在的行)
并非每个商品都有原始价格。以下是一种简单但不一定最有效的写值方法:
l <- json$state$searchItemList$resultList$prices
for (i in l){
if (length(i$originalPrice)>1){
print(i$originalPrice[2])
} else {
print("No original price")
}
}
R
library(rvest)
library(jsonlite)
library(stringr)
url = 'https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1'
r <- read_html(url) %>%
html_node('body') %>%
html_text() %>%
toString()
x <- str_match_all(r,'fbra_browseProductListConfig = (.*);')
json <- jsonlite::fromJSON(x[[1]][,2])
print(json$state$searchItemList$resultList$prices)
正则表达式说明:
答案 2 :(得分:1)
似乎是动态的,因此数据来自其他地方。我用数据寻找了JSON,XML等的GET响应,但没有找到任何东西。我现在将选择RSelenium。以下应提取正确的节点。您可以使用您喜欢的任何一种方法来从结果字符串中提取数字:
findElement
您还可以使用clickElement
和python myscript.py
浏览页面。有关更多信息,请参见Issue scraping page with "Load more" button with rvest。
答案 3 :(得分:1)
您还考虑了RSelenium
,这是带有相应软件包的解决方案。
您可以找到元素,例如通过xpath
。在您的情况下,xpath
将是:/html/body/div/main/div/div/div/section/div/div/div/div/div/a/div/p
。
它与@gersht的解决方案相似,但仅使用RSelenium
。
可复制的示例:
library(RSelenium)
rD <- rsDriver()
remDr <- rD$client
remDr$navigate(url)
priceElems = remDr$findElements(
using = "xpath",
value = "/html/body/div/main/div/div/div/section/div/div/div/div/div/a/div[@class = 'fb-price-list']"
)
rawPrices = sapply(
X = priceElems,
FUN = function(elem) elem$getElementText()
)
splitted = sapply(
X = rawPrices,
FUN = strsplit,
split = "\nS/"
)
prices = data.frame(
internetPrices = sapply(splitted, "[", 1),
normalPrices = sapply(splitted, "[", 2)
)
结果/输出:
> head(prices, 8)
internetPrices normalPrices
1 S/ 1,099 (Internet) 1,599 (Normal)
2 S/ 2,299 (Internet) 3,999 (Normal)
3 S/ 1,699 (Internet) 2,399 (Normal)
4 S/ 999 (Internet) 1,149 (Normal)
5 S/ 999 (Internet) 1,399 (Normal)
6 S/ 1,399 (Internet) 1,699 (Normal)
7 S/ 2,199 (Internet) <NA>
8 S/ 2,699 (Internet) 4,999 (Normal)
设置:
如果需要,请参见此处,了解如何设置RSenelium
:How to set up rselenium for R?。
修改:
在评论之后,如果同时捕获空元素,我将获取父元素,然后处理价格文本。
父元素为/html/body/div/main/div/div/div/section/div/div/div/div/div/a/div[@class = 'fb-price-list']
,如果其中一个价格不可用,则其包含一个空字符串。