我想从https://www.similarweb.com/中提取数据,但是当我运行我的代码时,它会显示(将HTML的输出转换为文本):
Pardon Our Interruption http://cdn.distilnetworks.com/css/distil.css" media="all" /> http://cdn.distilnetworks.com/images/anomaly-detected.png" alt="0" />
Pardon Our Interruption...
As you were browsing www.similarweb.com something about your browser made us think you were a bot. There are a few reasons this might happen:
You're a power user moving through this website with super-human speed.
You've disabled JavaScript in your web browser.
A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article .
After completing the CAPTCHA below, you will immediately regain access to www.similarweb.com.
if (!RecaptchaOptions){ var RecaptchaOptions = { theme : 'blackglass' }; }
You reached this page when attempting to access https://www.similarweb.com/ from 14.139.82.6 on 2017-05-22 12:02:37 UTC.
Trace: 9d8ae335-8bf6-4218-968d-eadddd0276d6 via 536302e7-b583-4c1f-b4f6-9d7c4c20aed2
我写了以下代码:
import urllib
from BeautifulSoup import *
url = "https://www.similarweb.com/"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print (soup.prettify())
# tags = soup('a')
# for tag in tags:
# print 'TAG:',tag
# print tag.get('href', None)
# print 'Contents:',tag.contents[0]
# print 'Attrs:',tag.attrs
任何人都可以帮助我如何提取信息吗?
答案 0 :(得分:1)
我试过requests
;它失败了。 selenium
似乎有效。
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get('https://www.similarweb.com/')