Question

我正试图刮掉内容 Financial Times Search页面。

使用Requests，我可以轻松删除文章＆＃39;标题和超链接。

我想获得下一页的超链接，但我无法在请求回复中找到它，不像文章＆＃39;标题或超链接。

from bs4 import BeautifulSoup
import requests

url = 'http://search.ft.com/search?q=SABMiller+PLC&t=all&rpp=100&fa=people%2Corganisations%2Cregions%2Csections%2Ctopics%2Ccategory%2Cbrand&s=-lastPublishDateTime&f=lastPublishDateTime[2000-01-01T00%3A00%3A00%2C2016-01-01T23%3A59%3A59]&curations=ARTICLES%2CBLOGS%2CVIDEOS%2CPODCASTS&highlight=true&p=1et'

response = requests.get(url, auth=(my login informations))

soup = BeautifulSoup(response.text, "lxml")

def get_titles_and_links():
    titles = soup.find_all('a')
    for ref in titles:
        if ref.get('title') and ref.get('onclick'):
            print ref.get('href')
            print ref.get('title')

get_titles_and_links（）函数为我提供了所有文章的标题和链接。

但是，对于下一页的功能类似，我没有结果：

def get_next_page():
    next_page = soup.find_all("li", class_="page next")
    return next_page

或者：

def get_next_page():
    next_page = soup.find_all('li')
    for ref in next_page:
        if ref.get('page next'):
            print ref.get('page next')

Answer 1

如果您可以在页面来源中看到所需的链接，但无法通过requests或urllib获取这些链接。它可能意味着两件事。

你的逻辑有问题。 让我们假设它不是那样。
然后剩下的就是： Ajax ，您要查找的页面的那些部分是在 document.onload方法触发后通过javascript 加载的。所以你不能在第一时间得到那些不存在的东西。

我的解决方案（更像是建议）

反向设计网络请求。 困难，但普遍适用。我亲自这样做。您可能想要使用re模块。

找到呈现 javascript的内容。这只是说，模拟网页浏览。您可能想查看selenium，Qt等的 webdriver 组件。这比较容易，但有点内存饥饿，与1相比消耗更多的网络资源即可。

Python请求：无法从页面中删除所有html代码

1 个答案: