我正在尝试从电子商务网页进行爬网。我已将容器(包含每个产品的单元格)的类名称标识为library(tidyverse)
df <- data.frame(
col1 = c("12/17/18", "10/10", "Best Movie", "This is the best movie ever...", "", "", "1/1/2019", "02/10", "Worst Movie", "This movie was awful..."),
stringsAsFactors = FALSE
)
df %>%
filter(col1 != '') %>% # drop empty rows
mutate(key = rep(c('Date', 'Rating', 'Title', 'Review'), n() / 4),
id = cumsum(key == 'Date')) %>%
spread(key, col1)
#> id Date Rating Review Title
#> 1 1 12/17/18 10/10 This is the best movie ever... Best Movie
#> 2 2 1/1/2019 02/10 This movie was awful... Worst Movie
。然后,我使用下面的代码抓取了该网页中的所有容器。之后,我使用'c3e8SH'
来检查网页中的容器数。
但是,它返回0。有人可以指出我在做什么吗?非常感谢你!
len(containers)
答案 0 :(得分:2)
cookies is needed
。如果仅请求链接get the validation page
without cookies
如下
dynamicly loaded
这就是为什么您通过网络浏览器看到的内容与通过代码获得的内容不同的原因
为方便起见,我更喜欢使用requests
模块。
import requests
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
cookies = {
"Hm_lvt_7cd4710f721b473263eed1f0840391b4":"1548133175,1548135160,1548135844",
"Hm_lpvt_7cd4710f721b473263eed1f0840391b4":"1548135844",
"x5sec":"7b22617365727665722d6c617a6164613b32223a223862623264333633343063393330376262313364633537653564393939303732434c50706d754946454e2b4b356f7231764b4c643841453d227d",
}
ret = requests.get(my_url, cookies=cookies)
print("New Super Mario Bros" in ret.text) # True
# then you can get a json-style shop-items in ret.text
商店商品,例如:
item_json =
{
"@context":"https://schema.org",
"@type":"ItemList",
"itemListElement":[
{
"offers":{
"priceCurrency":"SGD",
"@type":"Offer",
"price":"72.90",
"availability":"https://schema.org/InStock"
},
"image":"https://sg-test-11.slatic.net/p/ae0494e8a5eb7412830ac9822984f67a.jpg",
"@type":"Product",
"name":"Nintendo Switch New Super Mario Bros U Deluxe", # item name
"url":"https://www.lazada.sg/products/nintendo-switch-new-super-mario-bros-u-deluxe-i292338164-s484601143.html?search=1"
},
...
]
}
如json数据所示,您可以获取任何商品的名称,网址链接,价格等。
答案 1 :(得分:0)
尝试使用其他解析器。
我推荐lxml
。
因此,您创建page_soup
的行将是:
page_soup = soup(page_html, 'lxml')
答案 2 :(得分:0)
我尝试使用c3e8SH
在您建议的文档中找到regex
,但是我找不到这样的类名。请再次检查您的文件。