我正在尝试从网站中检索一个值,但我在属性之间没有得到任何值。(除了id = Avg Played)。我尝试过使用Scrapy和Beautiful Soup都无济于事! 这是我的BeautifulSoup / Urllib2代码:
import urllib2
from bs4 import BeautifulSoup
site = "http://www.lolking.net/champions/singed?#/overview"
request= urllib2.Request(site, headers={'User-Agent':'Chrome/44.0.2403.107'})
response = urllib2.urlopen(request)
html = response.read()
soup = BeautifulSoup(html, 'lxml')
champ_stats = soup.findAll('div', attrs={"class" : "champ-stats"})
champ_stats2 = soup.findAll('strong', attrs={"class" : "champ-stats"})
for x in champ_stats:
print x.text, x
print '\n now showing more specifically: \n'
for x in champ_stats2:
print x.text, x
我也使用Scrapy做了一个刮刀(得到了相同的结果):
import scrapy
class StatsSpider(scrapy.Spider):
name = "stat_spider"
start_urls = ["http://www.lolking.net/champions/singed?#/overview"]
def parse(self, response):
selector = '.champ-stats'
for stats in response.css(selector):
stat_selector = 'strong ::text'
name_selector = 'span ::text'
yield {
'stat': stats.css(stat_selector).extract_first(),
'name' : stats.css(name_selector).extract_first()
}
这就是浏览器中html的样子(我想要检索的内容):
html = """ <div class="champ-stats">
<strong id="winrate">48.3</strong><small>%</small>
<span>Win Rate</span>
</div>
<div class="divider"></div>
<div class="champ-stats">
<strong id="popularity">0.8</strong><small>%</small>
<span>Popularity</span>
</div>
<div class="divider"></div>
<div class="champ-stats">
<strong id="banrate">0.5</strong><small>%</small>
<span>Ban Rate</span>
</div>
<div class="divider"></div>
<div class="champ-stats">
<strong>10.2</strong>
<span>Avg Played</span>
</div>
</div> """
我猜这个网站有一种防止人们抓取这些数据的方法吗?如果是这样,有办法吗?
答案 0 :(得分:1)
您最好使用请求模块,而不是urllib2;它更简单易用。我应该提一下,虽然BeautifulSoup可能不足以完全刮掉这个页面,具体取决于你想要的东西。您可能需要诉诸硒或scrapy。
>>> import requests
>>> page = requests.get('http://www.lolking.net/champions/singed?#/overview').content
>>> import bs4
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> champ_stats = soup.findAll('div', attrs={"class" : "champ-stats"})
>>> for x in champ_stats:
... x.text, x
...
('\n%\nWin Rate\n', <div class="champ-stats">
<strong id="winrate"></strong><small>%</small>
<span>Win Rate</span>
</div>)
('\n%\nPopularity\n', <div class="champ-stats">
<strong id="popularity"></strong><small>%</small>
<span>Popularity</span>
</div>)
('\n%\nBan Rate\n', <div class="champ-stats">
<strong id="banrate"></strong><small>%</small>
<span>Ban Rate</span>
</div>)
('\n10.2\nAvg Played\n', <div class="champ-stats">
<strong>10.2</strong>
<span>Avg Played</span>
</div>)
编辑:
我不确定这是完全适合的。如果我理解正确,可以使用硒来清除这些值。
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get('http://www.lolking.net/champions/singed?#/overview')
>>> for item in driver.find_elements_by_xpath('.//div[@class="champ-stats"]/strong'):
... item.text
...
'48.4'
'0.8'
'0.4'
'10.2'