使用请求和beautifulsoup迭代Python中的页面

时间:2017-05-03 09:44:14

标签: python web-scraping beautifulsoup

我正在尝试从网站中提取链接。网页有多个页面,因此我使用循环来遍历不同的页面。然而,问题是汤和新链接中的内容只是重复。在requests.get中使用的URL发生了变化,我仔细检查了链接,以确保URL的内容发生了变化,而且确实如此。

无论循环的迭代如何,

new_links都保持不变

任何人都可以解释我怎么可能解决这个问题?

def get_links(root_url):

    list_of_links = []

    # how many pages should we scroll through ? currently set to 20
    for i in range(1,3):
        r = requests.get(root_url+"&page={}.".format(i))
        soup = BeautifulSoup(r.content, 'html.parser')
        new_links = soup.find_all("li", {"class": "padding-all"})
        list_of_links.extend(new_links)

    print(list_of_links)

    return list_of_links

1 个答案:

答案 0 :(得分:0)

您需要枚举您正在寻找的li内的链接。最好将每个添加到set()以删除重复项。然后可以将其转换为返回的排序列表:

from bs4 import BeautifulSoup
import requests

def get_links(root_url):
    set_of_links = set()

    # how many pages should we scroll through ? currently set to 20
    for i in range(1, 3):
        r = requests.get(root_url+"&page={}".format(i))
        soup = BeautifulSoup(r.content, 'html.parser')

        for li in soup.find_all("li", {"class": "padding-all"}):
            for a in li.find_all('a', href=True):
                set_of_links.update([a['href']])

    return sorted(set_of_links)

for index, link in enumerate(get_links("http://borsen.dk/soegning.html?query=iot"), start=1):
    print(index, link)

给你:

1 http://borsen.dk/nyheder/avisen/artikel/11/102926/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
2 http://borsen.dk/nyheder/avisen/artikel/11/111767/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
3 http://borsen.dk/nyheder/avisen/artikel/11/111771/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
4 http://borsen.dk/nyheder/avisen/artikel/11/111776/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
5 http://borsen.dk/nyheder/avisen/artikel/11/111789/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
6 http://borsen.dk/nyheder/avisen/artikel/11/114652/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
7 http://borsen.dk/nyheder/avisen/artikel/11/114677/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
8 http://borsen.dk/nyheder/avisen/artikel/11/117729/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
9 http://borsen.dk/nyheder/avisen/artikel/11/122984/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
10 http://borsen.dk/nyheder/avisen/artikel/11/124160/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
11 http://borsen.dk/nyheder/avisen/artikel/11/130267/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
12 http://borsen.dk/nyheder/avisen/artikel/11/130268/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
13 http://borsen.dk/nyheder/avisen/artikel/11/130272/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
14 http://borsen.dk/nyheder/avisen/artikel/11/130882/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
15 http://borsen.dk/nyheder/avisen/artikel/11/132641/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
16 http://borsen.dk/nyheder/avisen/artikel/11/145430/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
17 http://borsen.dk/nyheder/avisen/artikel/11/149967/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
18 http://borsen.dk/nyheder/avisen/artikel/11/151618/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
19 http://borsen.dk/nyheder/avisen/artikel/11/158183/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
20 http://borsen.dk/nyheder/avisen/artikel/11/158769/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
21 http://borsen.dk/nyheder/avisen/artikel/11/44962/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
22 http://borsen.dk/nyheder/avisen/artikel/11/93884/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
23 http://borsen.dk/nyheder/avisen/artikel/11/93890/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
24 http://borsen.dk/nyheder/avisen/artikel/11/93896/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
25 http://borsen.dk/nyheder/executive/artikel/11/161556/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
26 http://borsen.dk/nyheder/virksomheder/artikel/1/315489/rapport_digitale_tiltag_kan_transformere_danske_selskaber.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
27 http://borsen.dk/nyheder/virksomheder/artikel/1/337498/danske_virksomheder_overser_den_digitale_revolution.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
28 http://borsen.dk/opinion/blogs/view/17/3614/tingenes_internet__hvornaar_bliver_det_til_virkelighed.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
29 http://borsen.dk/opinion/blogs/view/17/4235/digitalisering_og_nye_forretningsmodeller.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
30 http://ledelse.borsen.dk/artikel/1/323424/burde_digitalisering_vaere_hoejere_paa_listen_over_foretrukne_ledelsesvaerktoejer.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
31 http://pleasure.borsen.dk/gadget/artikel/1/305849/digital_butler_styrer_din_kommende_bolig.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,

next page按钮中搜索链接可能更有意义,而不是猜测要迭代的页数,例如:

from bs4 import BeautifulSoup
import requests

def get_links(root_url):
    links = []

    while True:
        print(root_url)
        r = requests.get(root_url)
        soup = BeautifulSoup(r.content, 'html.parser')

        for li in soup.find_all("li", {"class": "padding-all"}):
            for a in li.find_all('a', href=True)[:1]:
                links.append(a['href'])

        next_page = soup.find("div", {"class": "next-container"})

        if next_page:
            next_url = next_page.find("a", href=True)

            if next_url:
                root_url = next_url['href']
            else:
                break
        else:
            break

    return links