BeautifulSoup findAll()不会显示每个标签

时间:2019-07-11 09:00:04

标签: python html web-scraping beautifulsoup

出于教育目的,我试图在9gag.com/hot page上对每个图像进行webscrap,我正在学习python和webscrapping。 这是我的非常基本格式的代码:

import requests, os, bs4

url = 'https://9gag.com/hot'            
os.makedirs('9gag', exist_ok=True)   

print('Downloading page %s...' % url)

res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
find = soup.findAll("img")

print(find)

这是我正在处理的html文件:

printscreen

我在理解findAll()或其他汤方法的工作方式时遇到了一个问题,因为当我运行这段代码时,虽然有很多标签,但没有找到标签。我不知道该怎么找到东西:通过标签,通过标签及其类,由父级或如何找到?

2 个答案:

答案 0 :(得分:0)

您看不到任何<img>标签,因为该页面通过AJAX动态加载了内容。如果您在Firefox或Chrome中打开开发人员工具,则会看到主要内容是通过JSON从其他网址加载的:

import requests, json

url = 'https://9gag.com/v1/featured-posts'

print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
data = res.json()
print(json.dumps(data, indent=4))

打印:

Downloading page https://9gag.com/v1/featured-posts...
{
    "meta": {
        "timestamp": 1562836411,
        "status": "Success",
        "sid": "9gVQ01EVjlHTUVkMMRVT1wEVFVTTn1TY"
    },
    "data": {
        "items": [
            {
                "itemId": "27568",
                "title": "The Corgi Who Plays Cheddar On Brooklyn Nine-Nine Has Passed Away",
                "url": "https://9gag.com/gag/adLm8rZ",
                "imageURL": "https://miscmedia-9gag-fun.9cache.com/images/featured/1562834921.0526_hYra9u_300.jpg",
                "upVoteCount": 19,
                "commentsCount": 12
            },
            {
                "itemId": "27566",
                "title": "Star Wars Reveals First Look At Sith Trooper For 'The Rise Of Skywalker'",
                "url": "https://9gag.com/gag/aZLGyEW",
                "imageURL": "https://miscmedia-9gag-fun.9cache.com/images/featured/1562833129.2422_NUTeny_300.jpg",
                "upVoteCount": 21,
                "commentsCount": 26
            },

... and so on.

答案 1 :(得分:0)

如前所述,内容是动态加载的。 您可以使用支持JavaScript的requests_html代替请求。

masterList.forEach(cat_met_item => {
  selectedList.forEach(filled_cat_met_item => {
    if (cat_met_item.category_id === filled_cat_met_item.category_id) {
      cat_met_item.meta_list.forEach(met_item => {
        filled_cat_met_item.meta_list.forEach(filled_met_item => {
          if (met_item.id === filled_met_item.id) {
            met_item["list"] = filled_met_item["item_list"];
          }
        });
      });
    }
  });
});

它给你类似的东西:

myVariable = 'This string is supposed to be raw \'
print(r'%s' %myVariable)