Question

我想下载一组网页的数据。

这是网址的示例：

http://www.signalpeptide.de/index.php?sess=&m=listspdb_mammalia&s=details&id=3&listname=

我的问题是：

网址中的'id ='号码会在不同的网页之间发生变化。
我想遍历并检索数据库中的所有页面。
将丢失id（例如，可能会有一个id = 3且id = 6但不是id = 4且id = 5的页面。）
我不知道最终的ID是多少（例如，数据库中的最后一页可能是id = 100000或id = 1000000000，我不知道。）

我知道我需要的两行代码通常是以某种方式创建数字列表，然后使用此代码循环遍历数字以下拉每页的文本（解析文本本身是另一天的工作）：

import urllib2
from bs4 import BeautifulSoup
web_page = "http://www.signalpeptide.de/index.php?sess=&m=listspdb_mammalia&s=details&id=" + id_name + "&listname="
page = urllib2.urlopen(web_page)
 soup = BeautifulSoup(page,'html.parser')

任何人都可以建议最好的方式来说“拿所有页面”来解决我遇到的缺少页面的问题，而不知道最后一页是什么时候？

Answer 1

为了获得可能的页面，你可以做类似的事情（我的例子是Python3）：

import re
from urllib.request import urlopen
from lxml import html

ITEMS_PER_PAGE = 50

base_url = 'http://www.signalpeptide.de/index.php'
url_params = '?sess=&m=listspdb_mammalia&start={}&orderby=id&sortdir=asc'


def get_pages(total):
    pages = [i for i in range(ITEMS_PER_PAGE, total, ITEMS_PER_PAGE)]
    last = pages[-1]
    if last < total:
        pages.append(last + (total - last))
    return pages

def generate_links():
    start_url = base_url + url_params.format(ITEMS_PER_PAGE)
    page = urlopen(start_url).read()
    dom = html.fromstring(page)
    xpath = '//div[@class="content"]/table[1]//tr[1]/td[3]/text()'
    pagination_text = dom.xpath(xpath)[0]
    total = int(re.findall(r'of\s(\w+)', pagination_text)[0])
    print(f'Number of records to scrape: {total}')
    pages = get_pages(total)
    links = (base_url + url_params.format(i) for i in pages)
    return links

基本上它的作用是获取第一页并获取记录数，假设每页有50条记录， get_pages（）函数可以计算传递给的页码启动参数并生成所有分页URL，您需要获取所有这些页面，使用每种蛋白质迭代表格并转到详细信息页面以使用BeautifulSoup或lxml与XPath获取所需的信息。我尝试使用asyncio同时获取所有这些页面，服务器超时:)。希望我的功能有所帮助！

从URL中截取数据：如何检索具有缺失和未知的最终页面ID的所有URL页面

1 个答案: