Question

出于教育目的，我试图在9gag.com/hot page上对每个图像进行webscrap，我正在学习python和webscrapping。这是我的非常基本格式的代码：

import requests, os, bs4

url = 'https://9gag.com/hot'            
os.makedirs('9gag', exist_ok=True)   

print('Downloading page %s...' % url)

res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
find = soup.findAll("img")

print(find)

这是我正在处理的html文件：

我在理解findAll（）或其他汤方法的工作方式时遇到了一个问题，因为当我运行这段代码时，虽然有很多标签，但没有找到标签。我不知道该怎么找到东西：通过标签，通过标签及其类，由父级或如何找到？

Answer 1

您看不到任何<img>标签，因为该页面通过AJAX动态加载了内容。如果您在Firefox或Chrome中打开开发人员工具，则会看到主要内容是通过JSON从其他网址加载的：

import requests, json

url = 'https://9gag.com/v1/featured-posts'

print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
data = res.json()
print(json.dumps(data, indent=4))

打印：

Downloading page https://9gag.com/v1/featured-posts...
{
    "meta": {
        "timestamp": 1562836411,
        "status": "Success",
        "sid": "9gVQ01EVjlHTUVkMMRVT1wEVFVTTn1TY"
    },
    "data": {
        "items": [
            {
                "itemId": "27568",
                "title": "The Corgi Who Plays Cheddar On Brooklyn Nine-Nine Has Passed Away",
                "url": "https://9gag.com/gag/adLm8rZ",
                "imageURL": "https://miscmedia-9gag-fun.9cache.com/images/featured/1562834921.0526_hYra9u_300.jpg",
                "upVoteCount": 19,
                "commentsCount": 12
            },
            {
                "itemId": "27566",
                "title": "Star Wars Reveals First Look At Sith Trooper For 'The Rise Of Skywalker'",
                "url": "https://9gag.com/gag/aZLGyEW",
                "imageURL": "https://miscmedia-9gag-fun.9cache.com/images/featured/1562833129.2422_NUTeny_300.jpg",
                "upVoteCount": 21,
                "commentsCount": 26
            },

... and so on.

Answer 2

如前所述，内容是动态加载的。您可以使用支持JavaScript的requests_html代替请求。

masterList.forEach(cat_met_item => {
  selectedList.forEach(filled_cat_met_item => {
    if (cat_met_item.category_id === filled_cat_met_item.category_id) {
      cat_met_item.meta_list.forEach(met_item => {
        filled_cat_met_item.meta_list.forEach(filled_met_item => {
          if (met_item.id === filled_met_item.id) {
            met_item["list"] = filled_met_item["item_list"];
          }
        });
      });
    }
  });
});

它给你类似的东西：

myVariable = 'This string is supposed to be raw \'
print(r'%s' %myVariable)

BeautifulSoup findAll（）不会显示每个标签

2 个答案: