Question

我正在尝试从电子商务网页进行爬网。我已将容器（包含每个产品的单元格）的类名称标识为library(tidyverse) df <- data.frame( col1 = c("12/17/18", "10/10", "Best Movie", "This is the best movie ever...", "", "", "1/1/2019", "02/10", "Worst Movie", "This movie was awful..."), stringsAsFactors = FALSE ) df %>% filter(col1 != '') %>% # drop empty rows mutate(key = rep(c('Date', 'Rating', 'Title', 'Review'), n() / 4), id = cumsum(key == 'Date')) %>% spread(key, col1) #> id Date Rating Review Title #> 1 1 12/17/18 10/10 This is the best movie ever... Best Movie #> 2 2 1/1/2019 02/10 This movie was awful... Worst Movie。然后，我使用下面的代码抓取了该网页中的所有容器。之后，我使用'c3e8SH'来检查网页中的容器数。

但是，它返回0。有人可以指出我在做什么吗？非常感谢你！

len(containers)

Answer 1

（1）首先，参数`cookies is needed`。

如果仅请求链接get the validation page

，则您将without cookies如下

https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1

（2）其次，您要抓取的页面是`dynamicly loaded`

这就是为什么您通过网络浏览器看到的内容与通过代码获得的内容不同的原因

为方便起见，我更喜欢使用requests模块。


import requests


my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'


cookies = {
    "Hm_lvt_7cd4710f721b473263eed1f0840391b4":"1548133175,1548135160,1548135844",
    "Hm_lpvt_7cd4710f721b473263eed1f0840391b4":"1548135844",
    "x5sec":"7b22617365727665722d6c617a6164613b32223a223862623264333633343063393330376262313364633537653564393939303732434c50706d754946454e2b4b356f7231764b4c643841453d227d",
}

ret = requests.get(my_url, cookies=cookies)
print("New Super Mario Bros" in ret.text) # True 

# then you can get a json-style shop-items in ret.text

商店商品，例如：

item_json = 

    {
        "@context":"https://schema.org",
        "@type":"ItemList",
        "itemListElement":[
            {
                "offers":{
                    "priceCurrency":"SGD",
                    "@type":"Offer",
                    "price":"72.90",
                    "availability":"https://schema.org/InStock"
                },
                "image":"https://sg-test-11.slatic.net/p/ae0494e8a5eb7412830ac9822984f67a.jpg",
                "@type":"Product",
                "name":"Nintendo Switch New Super Mario Bros U Deluxe",  # item name
                "url":"https://www.lazada.sg/products/nintendo-switch-new-super-mario-bros-u-deluxe-i292338164-s484601143.html?search=1"
            },
            ... 

        ]

    }

如json数据所示，您可以获取任何商品的名称，网址链接，价格等。

Answer 2

尝试使用其他解析器。我推荐lxml。因此，您创建page_soup的行将是： page_soup = soup(page_html, 'lxml')

Answer 3

我尝试使用c3e8SH在您建议的文档中找到regex，但是我找不到这样的类名。请再次检查您的文件。

无法从网页中删除容器

3 个答案:

（1）首先，参数`cookies is needed`。

（2）其次，您要抓取的页面是`dynamicly loaded`

无法从网页中删除容器

3 个答案:

（1）首先，参数cookies is needed。

（2）其次，您要抓取的页面是dynamicly loaded

（1）首先，参数`cookies is needed`。

（2）其次，您要抓取的页面是`dynamicly loaded`