如何使用Python和Beautiful-soup从Instagram抓取标签

时间:2020-05-25 15:36:46

标签: javascript python html beautifulsoup

我试图为instagram上最热门的标签找到相关标签,但在使用BeautifulSoup时却没有得到回报

import requests
import html5lib
import csv
from bs4 import BeautifulSoup

def list_of_tags(tags):
    related_tags = []
    tmp = []
    #for el in tags:
    url = "https://www.instagram.com/explore/tags/love/"
    req = requests.get(url)
    soup = BeautifulSoup(req.content, 'html5lib')
    print(soup)
    r_tag = soup.find('div', attrs = {'class' : 'WSpok'})

我已经在其他网站上使用类似的代码进行了抓取,并且成功完成了这项工作。但是,在尝试使用Instagram时,我没有得到任何HTML内容

<!DOCTYPE html>
<html class="no-js not-logged-in client-root" lang="en"><head>
        <meta charset="utf-8"/>
        <meta content="IE=edge" http-equiv="X-UA-Compatible"/>

        <title>
#love hashtag on Instagram • Photos and Videos
</title>


        <meta content="noimageindex, noarchive" name="robots"/>
        <meta content="default" name="apple-mobile-web-app-status-bar-style"/>
        <meta content="yes" name="mobile-web-app-capable"/>
        <meta content="#ffffff" name="theme-color"/>
        <meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, viewport-fit=cover" id="viewport" name="viewport"/>
        <link href="/data/manifest.json" rel="manifest"/>

        <link as="style" crossorigin="anonymous" href="/static/bundles/metro/ConsumerUICommons.css/0d73027e4285.css" rel="preload" type="text/css"/>
<link as="style" crossorigin="anonymous" href="/static/bundles/metro/ConsumerAsyncCommons.css/638f1bd337c8.css" rel="preload" type="text/css"/>
<link as="style" crossorigin="anonymous" href="/static/bundles/metro/Consumer.css/3e0c88f3bf5f.css" rel="preload" type="text/css"/>
<link as="style" crossorigin="anonymous" href="/static/bundles/metro/TagPageContainer.css/47d968faa0fd.css" rel="preload" type="text/css"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/Vendor.js/5a56d51ae30f.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/en_US.js/d9caef98221d.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/ConsumerLibCommons.js/e38c6c343804.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/ConsumerUICommons.js/7906f44838ea.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/ConsumerAsyncCommons.js/2196e3e614ee.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/Consumer.js/624d9b8ef745.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/TagPageContainer.js/63ead1147672.js" rel="preload" type="text/javascript"/>



        <script type="text/javascript">
        (function() {
  var docElement = document.documentElement;
  var classRE = new RegExp('(^|\\s)no-js(\\s|$)');
  var className = docElement.className;
  docElement.className = className.replace(classRE, '$1js$2');
})();
</script>
        <script type="text/javascript">
(function() {
  if ('PerformanceObserver' in window && 'PerformancePaintTiming' in window) {
    window.__bufferedPerformance = [];
    var ob = new PerformanceObserver(function(e) {
      window.__bufferedPerformance.push.apply(window.__bufferedPerformance,e.getEntries());
    });
    ob.observe({entryTypes:['paint']});
  }

  window.__bufferedErrors = [];
  window.onerror = function(message, url, line, column, error) {
    window.__bufferedErrors.push({
      message: message,
      url: url,
      line: line,
      column: column,
      error: error
    });
    return false;
  };
  window.__initialData = {
    pending: true,
    waiting: []
  };
  function asyncFetchSharedData(extra) {
    var sharedDataReq = new XMLHttpRequest();
    sharedDataReq.onreadystatechange = function() {
          if (sharedDataReq.readyState === 4) {
            if(sharedDataReq.status === 200){
              var sharedData = JSON.parse(sharedDataReq.responseText);
              window.__initialDataLoaded(sharedData, extra);
            }
          }
        }
    sharedDataReq.open('GET', '/data/shared_data/', true);
    sharedDataReq.send(null);
  }
  function notifyLoaded(item, data) {
    item.pending = false;
    item.data = data;
    for (var i = 0;i < item.waiting.length; ++i) {
      item.waiting[i].resolve(item.data);
    }
    item.waiting = [];
  }
  function notifyError(item, msg) {
    item.pending = false;
    item.error = new Error(msg);
    for (var i = 0;i < item.waiting.length; ++i) {
      item.waiting[i].reject(item.error);
    }
    item.waiting = [];
  }
  window.__initialDataLoaded = function(initialData, extraData) {
    if (extraData) {
      for (var key in extraData) {
        initialData[key] = extraData[key];
      }
    }
    notifyLoaded(window.__initialData, initialData);
  };
  window.__initialDataError = function(msg) {
    notifyError(window.__initialData, msg);
  };
  window.__additionalData = {};
  window.__pendingAdditionalData = function(paths) {
    for (var i = 0;i < paths.length; ++i) {
      window.__additionalData[paths[i]] = {
        pending: true,
        waiting: []
      };
    }
  };
  window.__additionalDataLoaded = function(path, data) {
    if (path in window.__additionalData) {
      notifyLoaded(window.__additionalData[path], data);
    } else {
      console.error('Unexpected additional data loaded "' + path + '"');
    }
  };
  window.__additionalDataError = function(path, msg) {
    if (path in window.__additionalData) {
      notifyError(window.__additionalData[path], msg);
    } else {
      console.error('Unexpected additional data encountered an error "' + path + '": ' + msg);
    }
  };

})();
</script><script type="text/javascript">

/*
 Copyright 2018 Google Inc. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
*/

(function(){function g(a,c){b||(b=a,f=c,h.forEach(function(a){removeEventListener(a,l,e)}),m())}function m(){b&&f&&0<d.length&&(d.forEach(function(a){a(b,f)}),d=[])}function n(a,c){function k(){g(a,c);d()}function b(){d()}function d(){removeEventListener("pointerup",k,e);removeEventListener("pointercancel",b,e)}addEventListener("pointerup",k,e);addEventListener("pointercancel",b,e)}function l(a){if(a.cancelable){var c=performance.now(),b=a.timeStamp;b>c&&(c=+new Date);c-=b;"pointerdown"==a.type?n(c,
a):g(c,a)}}var e={passive:!0,capture:!0},h=["click","mousedown","keydown","touchstart","pointerdown"],b,f,d=[];h.forEach(function(a){addEventListener(a,l,e)});window.perfMetrics=window.perfMetrics||{};window.perfMetrics.onFirstInputDelay=function(a){d.push(a);m()}})();
</script>

                <link href="/static/images/ico/apple-touch-icon-76x76-precomposed.png/666282be8229.png" rel="apple-touch-icon-precomposed" sizes="76x76"/>
                <link href="/static/images/ico/apple-touch-icon-120x120-precomposed.png/8a5bd3f267b1.png" rel="apple-touch-icon-precomposed" sizes="120x120"/>
                <link href="/static/images/ico/apple-touch-icon-152x152-precomposed.png/68193576ffc5.png" rel="apple-touch-icon-precomposed" sizes="152x152"/>
                <link href="/static/images/ico/apple-touch-icon-167x167-precomposed.png/4985e31c9100.png" rel="apple-touch-icon-precomposed" sizes="167x167"/>
                <link href="/static/images/ico/apple-touch-icon-180x180-precomposed.png/c06fdb2357bd.png" rel="apple-touch-icon-precomposed" sizes="180x180"/>

                    <link href="/static/images/ico/favicon-192.png/68d99ba29cc8.png" rel="icon" sizes="192x192"/>



                    <link color="#262626" href="/static/images/ico/favicon.svg/fc72dd4bfde8.svg" rel="mask-icon"/>

                  <link href="/static/images/ico/favicon.ico/36b3ee2d91ed.ico" rel="shortcut icon" type="image/x-icon"/>






</head>
    <body class="" style="
    background: white;
">



</body></html>

我尝试调用特定的div,但这没有用。我还有其他使用JSON请求的方法,但我想知道如何改进此版本。 预先感谢

1 个答案:

答案 0 :(得分:0)

我终于用json做到了

import requests
import html5lib
import json
import time
import csv
from bs4 import BeautifulSoup

def list_of_tags(tags):
    related_tags = []
    for el in tags:
        url = "https://www.instagram.com/explore/tags/"+ el +"/?__a=1"
        req = requests.get(url)
        data = json.loads(req.text)
        edges = data['graphql']['hashtag']['edge_hashtag_to_related_tags']['edges']
        for item in edges:
            related_tags.append(item['node']['name'])
    print(related_tags)

它将为您提供所有与您正在寻找的标签相关的标签

希望对某人有帮助。