Beautifulsoup无法使用find_all找到超过24个课程

时间:2018-08-09 21:23:51

标签: python html web-scraping beautifulsoup html-parsing

我正在尝试从这样存储所有项目的页面中获取数据

Diss p p'

其中有数百种,但是当我尝试将它们添加到数组中时,只会保存24

Diss q q'

重新编译是否有问题?我还能如何搜索课程?

感谢您的帮助

1 个答案:

答案 0 :(得分:0)

因此,在这种情况下,当您访问页面时,DOM中仅加载了24个项目。我想到的两个选项是:1)使用无头浏览器单击“加载更多”按钮并将更多项目加载到DOM上;或2)创建简单的分页方案并遍历这些页面。

以下是第二个选项的示例:

for page in range(0, 10):
    print("Trying page # {}".format(page))
    if page == 0:
        my_url = 'https://www.alza.co.uk/tablets/18852388.html'
    else: 
        my_url = 'https://www.alza.co.uk/tablets/18852388-p{}.html'.format(page)
        requests.get(my_url)

    page_html = requests.get(my_url)
    page_soup = soup(page_html.content, "lxml")
    items = page_soup.find_all('div', {"class": "browsingitem"})
    print("Found a total of {}".format(len(items)))
    for item in items:
        title  = page_soup.find('a', 'browsinglink')

您会看到URL内置了分页信息,因此您要做的就是确定要抓取的页面数,然后保存所有这些信息。输出为:

Trying page # 0
Found a total of 24
Trying page # 1
Found a total of 24
Trying page # 2
Found a total of 24
Trying page # 3
Found a total of 24
Trying page # 4
Found a total of 24
Trying page # 5
Found a total of 24
Trying page # 6
Found a total of 24
Trying page # 7
Found a total of 24
Trying page # 8
Found a total of 17
Trying page # 9
Found a total of 0