Question

我想获得一个网站的多个“页面”，由于某种原因，正确的网址无法提供预期的结果。

我看了应该使用的url，它工作得很好，并尝试使用一些变量更改。

for i in range(1,100):
    MLinks.append("https://#p" + str(i))

for i in range(1,100):
    x = i-1
    MainR = requests.get(MLinks[x])

    SMHTree = html.fromstring(MainR.content)
    MainData = SMHTree.xpath('//@*')
    j=0
    while j <len(MainData):
        if 'somthing' in MainData[j] :
            PLinks.append(MainData[j]) #Links of products
        j=j+1

我希望获得每一页，但是当我阅读内容时，总会得到第一页的内容。

Answer 1

我假设您要请求的网址如下：

library(dplyr)
df %>%
   mutate_if(is.factor, as.character) %>%
   mutate_if(is.character, make.unique)

也就是说，您代码的第二行实际上是

https://somehost.com/products/#p1
https://somehost.com/products/#p2
https://somehost.com/products/#p3
...

在执行请求时，服务器从不会看到＃后面的部分（该部分称为锚点）。因此，服务器仅收到100个“ https://somehost.com/products/”请求，这些请求均给出相同的结果。请访问此网站进一步解释：https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL。

客户端JavaScript有时会使用锚来动态加载页面。这意味着如果您打开“ https://somehost.com/products/”并导航到“ https://somehost.com/products/#p5”，客户端JavaScript将注意到它，并（通常）向其他URL发出请求以将产品加载到第5页。该其他URL将不是“ https://somehost.com/products/#5”！要了解此URL是什么，请打开浏览器的开发人员工具，并浏览到其他产品页面时查看发出了哪些网络请求。

request.get没有得到正确的信息

1 个答案: