Question

我有这个网址：https://www.ft.com/content/87d644fc-73a4-11e7-aca6-c6bd07df1a3c

它对应于需要注册的文章。我注册了，可以在我的浏览器中查看内容。但是，当我将此代码与上面的url一起使用时：

soup = BeautifulSoup(urllib2.urlopen(url), 'lxml')
with open('ctp_output.txt', 'w') as f:
    for tag in soup.find_all('p'):
        f.write(tag.text.encode('utf-8') + '\n')

特别是，它会在注册页面上重定向。在抓取时有没有办法登录才能访问该文章？

Answer 1

以下是基础知识。

转到登录页面。如果您使用Chrome浏览器，则可以将鼠标放在电子邮件输入区域上并使用上下文菜单（在Windows中），然后使用其“检查”条目以显示将用于提交电子邮件地址的form元素。它看起来像这样。

<form name="enter-email-form" action="/login/submitEmail" class="js-email-lookup-form" method="POST" data-test-id="enter-email-form" novalidate="true">
        <input type="hidden" name="location" value="https://www.ft.com/content/87d644fc-73a4-11e7-aca6-c6bd07df1a3c">
        <input type="hidden" name="continueUrl" value="">
        <input type="hidden" name="readerId" value="">
        <input type="hidden" name="loginUrl" value="/login?location=https%3A%2F%2Fwww.ft.com%2Fcontent%2F87d644fc-73a4-11e7-aca6-c6bd07df1a3c">
        <div class="lgn-box__title">
            <h1 class="lgn-heading--alpha">Sign in</h1>
        </div>
        <div class="o-forms-group">
            <label for="email" class="o-forms-label">Email address</label>
            <input type="email" id="email" class="o-forms-text js-email" name="email" maxlength="64" autocomplete="off" autofocus="" required="">
            <input type="password" id="password" name="password" style="display:none">
            <label for="password">
        </label></div>
        <div class="o-forms-group">
            <button class="o-buttons o-buttons--standout o-buttons--big" type="submit" name="Next">Next</button>
        </div>
    </form>

您需要从action元素中收集form属性，并从input语句中收集所有名称 - 值对。您可以在requests library。

的POST请求中使用这些

您只需为您的电子邮件地址执行此操作一次，并为密码执行一次。然后，您应该能够通过请求为URL发出GET。

我必须警告你，我实际上没有尝试过这个特定的网站。

Answer 2

如果要使用BeautifulSoup抓取网站，我建议使用MechanicalSoup库。它是BeautifulSoup（用于解析HTML）和请求（用于获取页面）之上非常轻巧的一层，但是它将为您处理诸如正确填写表单（即您在此处需要的内容），以下相对链接， ...

MechanicalSoup在不能解释JavaScript代码的意义上也受到限制，因此无法在依赖JavaScript的网站上运行，但与使用BeautifulSoup和urllib或直接请求相比，它减少了人工工作。

（注意：我是MechanicalSoup的作者之一）

使用BeautifulSoup登录并抓取像ft.com这样的网站

2 个答案: