使用美丽的汤python从semrush抓取网站流量

时间:2018-08-15 20:19:52

标签: python web-scraping beautifulsoup

我正试图从semrush.com抓取网站流量。

我当前使用BeautifulSoup的代码是:

from bs4 import BeautifulSoup, BeautifulStoneSoup
import urllib
import json

req = urllib.request.Request('https://www.semrush.com/info/burton.com', headers={'User-Agent':'Magic Browser'})
response = urllib.request.urlopen(req)
raw_data = response.read()
response.close()

soup = BeautifulSoup(raw_data)

我一直在尝试data = soup.findAll("a", {"href":"/info/burton.com+(by+organic)"})data = soup.findAll("span", {"class":"sem-report-counter"}),但运气不佳。

我可以在网页上看到自己想要的号码。有没有办法提取这些信息?我在我拉的html中看不到它。

1 个答案:

答案 0 :(得分:1)

我付出了更多努力,并建立了一个有效的示例来说明如何使用@media screen and (max-width:600px) { #sidebar_container { width:100%; /* or something like that */ } } 刮取该页面。安装selenium并尝试一下!

selenium

我在终端上看到的输出:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = 'https://www.semrush.com/info/burton.com' #your url
options = Options() #set up options
options.add_argument('--headless') #add --headless mode to options
driver = webdriver.Chrome(executable_path='/opt/ChromeDriver/chromedriver',
                      chrome_options=options)

#note: executable_path will depend on where your chromedriver.exe is located

driver.get(url) #get response
driver.implicitly_wait(1) #wait to load content
elements = driver.find_elements_by_xpath(xpath='//a[@href="/info/burton.com+(by+organic)"]') #grab that stuff you wanted?  

for e in elements: print(e.get_attribute('text').strip()) #print text fields

driver.quit() #close the driver when you're done