不能从BeautifulSoup或lxml抓取网页

时间:2017-08-07 18:27:24

标签: python beautifulsoup lxml

我对编程很陌生,所以这可能是一个愚蠢的问题。我想学习刮网页。所以我学会了BeautifulSoup来做它.....适用于少数几个网站,但却陷入了下一页

from bs4 import BeautifulSoup
import requests

r  = requests.get("http://www.dlb.today/result/en")
data = r.text
soup = BeautifulSoup(data, "lxml")

data = soup.find_all("tbody", {"id": "pageData1"})
data2 = soup.find_all("ul", {"class": "res_allnumber"})
print data
print data2
#no point going further if I cant get raw data I think

这很好(我刮过的类似网站)

r2  = requests.get("http://www.nlb.lk/results-more.php?id=1")
data2 = r2.text
soup2 = BeautifulSoup(data2, "lxml")
news2 = soup2.find_all("a", {"class": "lottery-numbers"})
#print news2 #(get raw Html for checking)
for draw_number in news2:
   print draw_number.contents[0]

我无法刮掉我想要的桌子。所以我尝试了LXML来做...仍然没有运气.............

#lxml
import requests

r  = requests.get("http://www.dlb.today/result/en")
data = r.text

#print data

import lxml.html as LH

content = data
root = LH.fromstring(content)
for tag1 in root.xpath('//tbody[@class="pageData1"]//li'):  
    print tag1.text_content()

我不知道我的错误在哪里或下一步做什么......如果有人能指出我正确的方向,我会感激它!

2 个答案:

答案 0 :(得分:1)

我试过复制你的用例。似乎数据没有加载到页面中,python代码已经发出请求。结果," tbody"它的内容是空的。

我确实通过下载HTML文件来确认

fh = open('sample.html','w')      
fh.write(data)      
fh.close() 

网上提到了一些解决此问题的解决方案:

  1. 使用名为 dryscrape 的Python库。详情请参阅Web-scraping JavaScript page with Python

  2. 使用硒:

  3. from selenium import webdriver
    import time
    driver = webdriver.Firefox(executable_path = 'geckodriver.exe')
    driver.get("http://www.dlb.today/result/en")
    time.sleep(5)
    htmlSource = driver.page_source
    

    here下载geckodriver。此外,您可以使用htmlsource作为BeautifulSoup的输入

答案 1 :(得分:1)

加载数据以显示此页面时涉及JavaScript。幸运的是,JavaScript从URL加载了另一个HTML页面

http://www.dlb.today/result/pagination_re

您可以直接使用POST请求访问此URL:

import requests
from bs4 import BeautifulSoup

url = "http://www.dlb.today/result/pagination_re"
data = {"pageId": "0", "resultID": "1001", "lotteryID": "1", "lastsegment": "en"}
page = requests.post(url, data)
soup = BeautifulSoup(page.content,'html.parser')
for data in soup.find_all("ul", {"class": "res_allnumber"}):
    print (data)

您可能需要尝试使用“数据”值来获得您想要的内容!

输出结果为:

<ul class="res_allnumber"><li class="res_number">04</li><li class="res_number">30</li><li class="res_number">44</li><li class="res_number">56</li><li class="res_number" style="background-color: #971B7E; color: #fff;">29</li><li class="res_eng_letter">V</li></ul>
<ul class="res_allnumber"><li class="res_number">15</li><li class="res_number">41</li><li class="res_number">43</li><li class="res_number">47</li><li class="res_number" style="background-color: #016B21; color: #fff;">69</li><li class="res_eng_letter">Z</li></ul>
<ul class="res_allnumber"><li class="res_number">09</li><li class="res_number">13</li><li class="res_number">17</li><li class="res_number">48</li><li class="res_number" style="background-color: #267FFF; color: #fff;">73</li><li class="res_eng_letter">D</li></ul>
<ul class="res_allnumber"><li class="res_number">31</li><li class="res_number">41</li><li class="res_number">43</li><li class="res_number">55</li><li class="res_number" style="background-color: #971B7E; color: #fff;">52</li><li class="res_eng_letter">U</li></ul>
<ul class="res_allnumber"><li class="res_number">03</li><li class="res_number">09</li><li class="res_number">19</li><li class="res_number">73</li><li class="res_number" style="background-color: #016B21; color: #fff;">67</li><li class="res_eng_letter">E</li></ul>
<ul class="res_allnumber"><li class="res_number">17</li><li class="res_number">22</li><li class="res_number">35</li><li class="res_number">39</li><li class="res_number" style="background-color: #267FFF; color: #fff;">59</li><li class="res_eng_letter">Z</li></ul>
<ul class="res_allnumber"><li class="res_number">08</li><li class="res_number">15</li><li class="res_number">30</li><li class="res_number">55</li><li class="res_number" style="background-color: #971B7E; color: #fff;">71</li><li class="res_eng_letter">I</li></ul>
<ul class="res_allnumber"><li class="res_number">11</li><li class="res_number">16</li><li class="res_number">50</li><li class="res_number">57</li><li class="res_number" style="background-color: #016B21; color: #fff;">75</li><li class="res_eng_letter">Q</li></ul>
<ul class="res_allnumber"><li class="res_number">27</li><li class="res_number">30</li><li class="res_number">43</li><li class="res_number">71</li><li class="res_number" style="background-color: #267FFF; color: #fff;">63</li><li class="res_eng_letter">E</li></ul>
<ul class="res_allnumber"><li class="res_number">19</li><li class="res_number">20</li><li class="res_number">31</li><li class="res_number">43</li><li class="res_number" style="background-color: #971B7E; color: #fff;">61</li><li class="res_eng_letter">I</li></ul>
<ul class="res_allnumber"><li class="res_number">24</li><li class="res_number">41</li><li class="res_number">47</li><li class="res_number">72</li><li class="res_number" style="background-color: #016B21; color: #fff;">32</li><li class="res_eng_letter">K</li></ul>
<ul class="res_allnumber"><li class="res_number">13</li><li class="res_number">51</li><li class="res_number">61</li><li class="res_number">65</li><li class="res_number" style="background-color: #267FFF; color: #fff;">48</li><li class="res_eng_letter">E</li></ul>