为什么此网络抓取脚本不起作用？

时间：2020-04-24 23:58:58

标签： python web-scraping beautifulsoup

简要说明

我正在尝试使用bs4通过给定的标签搜索instagram上的帖子。
我要查找的标签是div class="v1Nh3：

我做了什么

target = "https://www.instagram.com/explore/tags/test"
html = requests.get(target)
soup = BeautifulSoup(html.content,"html.parser")
root = soup.find(id="react-root")
posts = soup.find_all("div",class_="v1Nh3")

但是，如果我打印变量 posts ，则会得到一个空列表。打印 root 时会显示一些奇怪的结果：

<div id="react-root">
    <span><svg height="50" style="position:absolute;top:50%;left:50%;margin:-25px 0 0 -25px;fill:#c7c7c7" viewbox="0 0 50 50" width="50"><path d="M25 1c-6.52 0-7.34.03-9.9.14-2.55.12-4.3.53-5.82 1.12a11.76 11.76 0 0 0-4.25 2.77 11.76 11.76 0 0 0-2.77 4.25c-.6 1.52-1 3.27-1.12 5.82C1.03 17.66 1 18.48 1 25c0 6.5.03 7.33.14 9.88.12 2.56.53 4.3 1.12 5.83a11.76 11.76 0 0 0 2.77 4.25 11.76 11.76 0 0 0 4.25 2.77c1.52.59 3.27 1 5.82 1.11 2.56.12 3.38.14 9.9.14 6.5 0 7.33-.02 9.88-.14 2.56-.12 4.3-.52 5.83-1.11a11.76 11.76 0 0 0 4.25-2.77 11.76 11.76 0 0 0 2.77-4.25c.59-1.53 1-3.27 1.11-5.83.12-2.55.14-3.37.14-9.89 0-6.51-.02-7.33-.14-9.89-.12-2.55-.52-4.3-1.11-5.82a11.76 11.76 0 0 0-2.77-4.25 11.76 11.76 0 0 0-4.25-2.77c-1.53-.6-3.27-1-5.83-1.12A170.2 170.2 0 0 0 25 1zm0 4.32c6.4 0 7.16.03 9.69.14 2.34.11 3.6.5 4.45.83 1.12.43 1.92.95 2.76 1.8a7.43 7.43 0 0 1 1.8 2.75c.32.85.72 2.12.82 4.46.12 2.53.14 3.29.14 9.7 0 6.4-.02 7.16-.14 9.69-.1 2.34-.5 3.6-.82 4.45a7.43 7.43 0 0 1-1.8 2.76 7.43 7.43 0 0 1-2.76 1.8c-.84.32-2.11.72-4.45.82-2.53.12-3.3.14-9.7.14-6.4 0-7.16-.02-9.7-.14-2.33-.1-3.6-.5-4.45-.82a7.43 7.43 0 0 1-2.76-1.8 7.43 7.43 0 0 1-1.8-2.76c-.32-.84-.71-2.11-.82-4.45a166.5 166.5 0 0 1-.14-9.7c0-6.4.03-7.16.14-9.7.11-2.33.5-3.6.83-4.45a7.43 7.43 0 0 1 1.8-2.76 7.43 7.43 0 0 1 2.75-1.8c.85-.32 2.12-.71 4.46-.82 2.53-.11 3.29-.14 9.7-.14zm0 7.35a12.32 12.32 0 1 0 0 24.64 12.32 12.32 0 0 0 0-24.64zM25 33a8 8 0 1 1 0-16 8 8 0 0 1 0 16zm15.68-20.8a2.88 2.88 0 1 0-5.76 0 2.88 2.88 0 0 0 5.76 0z"></path></svg></span>
</div>

我猜这种行为与反应有关，但我不确定。所以我的问题是：
-为什么会这样？
-是否可以使用bs4以及如何完成？
-如果无法使用selenium或其他工具

1 个答案:

答案 0 :(得分：1)

标记发布数据包含在页面源代码的json object上，即：

import requests, json, re

u = "https://www.instagram.com/explore/tags/test/"
html = requests.get(u).text

matches = re.findall(r"window\._sharedData = (\{.*:false\});</script>", html, re.IGNORECASE | re.MULTILINE)
if matches:
    test = json.loads(matches[0])
    # browse the json object at https://jsoneditoronline.org/#left=cloud.5931e80efee541f69a856daf31a96d1b

    for n in test['entry_data']['TagPage'][0]['graphql']['hashtag']['edge_hashtag_to_media']['edges']:
        shortcode = n['node']['shortcode']
        display_url = n['node']['display_url']
        thumbnail_src = n['node']['thumbnail_src']
        is_video = n['node']['is_video']
        accessibility_caption = n['node']['accessibility_caption']
        taken_at_timestamp = n['node']['taken_at_timestamp']
        owner = n['node']['owner']['id']
        edge_liked_by = n['node']['edge_liked_by']['count']
        # ...

        print(shortcode, display_url, edge_liked_by, owner)

        if n['node']['edge_media_to_caption']['edges']:
            for tags in n['node']['edge_media_to_caption']['edges']:
                post_text = tags['node']['text']
                print(post_text)

B_YtTM0nAe3 https://scontent-iad3-1.cdninstagram.com/v/t51.2885-15/e35/94292033_2909314699138403_547642880292448809_n.jpg?_nc_ht=scontent-iad3-1.cdninstagram.com&_nc_cat=111&_nc_ohc=LPorzz52Fe8AX-80CFG&oh=7a0379f33ec4f42ad36506da32bb40ae&oe=5ECB8677 1 5912200464
#mcq #602 #commercenewsguruji #bestoftheday #instadaily #instalike #igdaily #igers  #instalove #instagood #instadaily #dailymcq #mcqswithanswers #assessment #assessmenttest #assessmentmcq #test #practice
...