刮网页

时间:2017-10-14 18:38:14

标签: python web-scraping python-requests

我正在尝试编写一个Python脚本来从这个webpage.中抓取数据我试图从第二个表('class': 'char-pico-table')中抓取数据并使用此脚本来执行此操作:

def getPICO(url):
    r = requests.get(url)
    print (r.content)

然而,这打印出来:

b'<!DOCTYPE html>\n<html class="view">\n  <head>\n    <title>RobotReviewer: Automating evidence synthesis</title>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <meta name="google" content="notranslate">\n\n    <link rel="stylesheet" type="text/css" href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css">\n    <link rel="stylesheet" type="text/css" href="/css/main.css">\n    <link rel="stylesheet alternative prefetch" type=text/css href="/css/report.css">\n\n    <!-- Preload examples -->\n    <link rel="prefetch" href="/report_view/Tvg0-pHV2QBsYpJxE2KW-/html">\n    <link rel="prefetch" href="/report_view/_fzGUEvWAeRsqYSmNQbBq/html">\n    <link rel="prefetch" href="/report_view/HBkzX1I3Uz_kZEQYeqXJf/html">\n\n    <!-- / Preload examples -->\n\n\n    <script src="/scripts/modernizr.js"></script>\n    <script src="/scripts/spa/scripts/vendor/pdfjs/pdf.js"></script>\n    <script src="/scripts/spa/scripts/vendor/compatibility.js"></script>\n    <script data-main="/scripts/main" src="/scripts/require.js"></script>\n\n    <script>\n     PDFJS.disableWebGL = false;\n     CSRF_TOKEN = "1508009356##6a03b1bf519972b27a0d871ae4823eb3a3366c0c";\n    </script>\n  </head>\n\n  <body>\n    <nav id="top-bar" class="top-bar" data-topbar role="navigation">\n      <div>\n        <ul class="title-area">\n          <li class="name">\n            <h1><a href="/"><img src="/img/logo.svg" width="190px"></a></h1>\n          </li>\n        </ul>\n\n        <section class="top-bar-section">\n          <ul class="right">\n            <li><a href="http://www.robotreviewer.net">About</a></li>\n          </ul>\n        </section>\n      </div>\n    </nav>\n\n    <div id="breadcrumbs"></div>\n\n    <main id="main"></main>\n\n\n  </body>\n</html>'

这不是我在浏览器中查看页面时看到的输出 - 它不包含我想要删除的数据。为什么不是这样?

在网络浏览器中查看页面时,它看起来像这样:

Expected Output

1 个答案:

答案 0 :(得分:1)

根据@Shahin的评论,我编写了以下代码,它以JSON格式提供了数据,我可以从中轻松提取数据。

: