我正试图从semrush.com抓取网站流量。
我当前使用BeautifulSoup的代码是:
from bs4 import BeautifulSoup, BeautifulStoneSoup
import urllib
import json
req = urllib.request.Request('https://www.semrush.com/info/burton.com', headers={'User-Agent':'Magic Browser'})
response = urllib.request.urlopen(req)
raw_data = response.read()
response.close()
soup = BeautifulSoup(raw_data)
我一直在尝试data = soup.findAll("a", {"href":"/info/burton.com+(by+organic)"})
或data = soup.findAll("span", {"class":"sem-report-counter"})
,但运气不佳。
我可以在网页上看到自己想要的号码。有没有办法提取这些信息?我在我拉的html中看不到它。
答案 0 :(得分:1)
我付出了更多努力,并建立了一个有效的示例来说明如何使用@media screen and (max-width:600px) {
#sidebar_container {
width:100%; /* or something like that */
}
}
刮取该页面。安装selenium
并尝试一下!
selenium
我在终端上看到的输出:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = 'https://www.semrush.com/info/burton.com' #your url
options = Options() #set up options
options.add_argument('--headless') #add --headless mode to options
driver = webdriver.Chrome(executable_path='/opt/ChromeDriver/chromedriver',
chrome_options=options)
#note: executable_path will depend on where your chromedriver.exe is located
driver.get(url) #get response
driver.implicitly_wait(1) #wait to load content
elements = driver.find_elements_by_xpath(xpath='//a[@href="/info/burton.com+(by+organic)"]') #grab that stuff you wanted?
for e in elements: print(e.get_attribute('text').strip()) #print text fields
driver.quit() #close the driver when you're done