像我在首页中一样,如何抓取下一页数据?

时间:2018-10-12 05:07:03

标签: python python-3.x web-scraping beautifulsoup

我有以下代码:

from bs4 import BeautifulSoup
import requests
import csv
url = "https://coingecko.com/en"
base_url = "https://coingecko.com"
page = requests.get(url)
soup = BeautifulSoup(page.content,"html.parser")
names = [div.a.span.text for div in soup.find_all("div",attrs={"class":"coin-content center"})]
Link = [base_url+div.a["href"] for div in soup.find_all("div",attrs={"class":"coin-content center"})]
for link in Link:
    inner_page = requests.get(link)
    inner_soup = BeautifulSoup(inner_page.content,"html.parser")
    indent = inner_soup.find("div",attrs={"class":"py-2"})
    content = indent.div.next_siblings
    Allcontent = [sibling for sibling in content if sibling.string is not None]
    print(Allcontent)

我已经成功进入内页并从列出的硬币的第一页中获取了所有硬币的信息。但是还有下一页,例如1,2,3,4,5,6,7,8,9等。我该如何转到所有下一页,并做与以前相同的操作?

此外,我的代码的输出包含大量\n和空格。我该如何解决。

2 个答案:

答案 0 :(得分:1)

您需要生成所有页面并一一请求并使用bs4进行解析

from bs4 import BeautifulSoup
import requests

req = requests.get('https://www.coingecko.com/en')
soup = BeautifulSoup(req.content, 'html.parser')
last_page = soup.select('ul.pagination li:nth-of-type(8) > a:nth-of-type(1)')[0]['href']
lp = last_page.split('=')[-1]
count = 0
for i in range(int(lp)):
    count+=1
    url = 'https://www.coingecko.com/en?page='+str(count)
    print(url)
    requests.get(url)#requests each page one by one till last page
    ##parse your fileds here using bs4

答案 1 :(得分:0)

您编写脚本的方式看起来很乱。尝试使用 .select() 使其简洁且不易损坏。尽管我在您的脚本中找不到names的更多用法,但我还是照原样保留了它。这是获取跨多个页面的所有可用链接的方法。

from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests

url = "https://coingecko.com/en"

while True:
    page = requests.get(url)
    soup = BeautifulSoup(page.text,"lxml")
    names = [item.text for item in soup.select("span.d-lg-block")]
    for link in [urljoin(url,item["href"]) for item in soup.select(".coin-content a")]:
        inner_page = requests.get(link)
        inner_soup = BeautifulSoup(inner_page.text,"lxml")
        desc = [item.get_text(strip=True) for item in inner_soup.select(".py-2 p") if item.text]
        print(desc)

    try:
        url = urljoin(url,soup.select_one(".pagination a[rel='next']")['href'])
    except TypeError:break

顺便说一句,使用.get_text(strip=True)也可以解决空格问题