我对编程很陌生,所以这可能是一个愚蠢的问题。我想学习刮网页。所以我学会了BeautifulSoup来做它.....适用于少数几个网站,但却陷入了下一页
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.dlb.today/result/en")
data = r.text
soup = BeautifulSoup(data, "lxml")
data = soup.find_all("tbody", {"id": "pageData1"})
data2 = soup.find_all("ul", {"class": "res_allnumber"})
print data
print data2
#no point going further if I cant get raw data I think
这很好(我刮过的类似网站)
r2 = requests.get("http://www.nlb.lk/results-more.php?id=1")
data2 = r2.text
soup2 = BeautifulSoup(data2, "lxml")
news2 = soup2.find_all("a", {"class": "lottery-numbers"})
#print news2 #(get raw Html for checking)
for draw_number in news2:
print draw_number.contents[0]
我无法刮掉我想要的桌子。所以我尝试了LXML来做...仍然没有运气.............
#lxml
import requests
r = requests.get("http://www.dlb.today/result/en")
data = r.text
#print data
import lxml.html as LH
content = data
root = LH.fromstring(content)
for tag1 in root.xpath('//tbody[@class="pageData1"]//li'):
print tag1.text_content()
我不知道我的错误在哪里或下一步做什么......如果有人能指出我正确的方向,我会感激它!
答案 0 :(得分:1)
我试过复制你的用例。似乎数据没有加载到页面中,python代码已经发出请求。结果," tbody"它的内容是空的。
我确实通过下载HTML文件来确认
fh = open('sample.html','w')
fh.write(data)
fh.close()
网上提到了一些解决此问题的解决方案:
使用名为 dryscrape 的Python库。详情请参阅Web-scraping JavaScript page with Python
使用硒:
from selenium import webdriver import time driver = webdriver.Firefox(executable_path = 'geckodriver.exe') driver.get("http://www.dlb.today/result/en") time.sleep(5) htmlSource = driver.page_source
从here下载geckodriver。此外,您可以使用htmlsource作为BeautifulSoup的输入
答案 1 :(得分:1)
加载数据以显示此页面时涉及JavaScript。幸运的是,JavaScript从URL加载了另一个HTML页面
http://www.dlb.today/result/pagination_re
您可以直接使用POST请求访问此URL:
import requests
from bs4 import BeautifulSoup
url = "http://www.dlb.today/result/pagination_re"
data = {"pageId": "0", "resultID": "1001", "lotteryID": "1", "lastsegment": "en"}
page = requests.post(url, data)
soup = BeautifulSoup(page.content,'html.parser')
for data in soup.find_all("ul", {"class": "res_allnumber"}):
print (data)
您可能需要尝试使用“数据”值来获得您想要的内容!
输出结果为:
<ul class="res_allnumber"><li class="res_number">04</li><li class="res_number">30</li><li class="res_number">44</li><li class="res_number">56</li><li class="res_number" style="background-color: #971B7E; color: #fff;">29</li><li class="res_eng_letter">V</li></ul>
<ul class="res_allnumber"><li class="res_number">15</li><li class="res_number">41</li><li class="res_number">43</li><li class="res_number">47</li><li class="res_number" style="background-color: #016B21; color: #fff;">69</li><li class="res_eng_letter">Z</li></ul>
<ul class="res_allnumber"><li class="res_number">09</li><li class="res_number">13</li><li class="res_number">17</li><li class="res_number">48</li><li class="res_number" style="background-color: #267FFF; color: #fff;">73</li><li class="res_eng_letter">D</li></ul>
<ul class="res_allnumber"><li class="res_number">31</li><li class="res_number">41</li><li class="res_number">43</li><li class="res_number">55</li><li class="res_number" style="background-color: #971B7E; color: #fff;">52</li><li class="res_eng_letter">U</li></ul>
<ul class="res_allnumber"><li class="res_number">03</li><li class="res_number">09</li><li class="res_number">19</li><li class="res_number">73</li><li class="res_number" style="background-color: #016B21; color: #fff;">67</li><li class="res_eng_letter">E</li></ul>
<ul class="res_allnumber"><li class="res_number">17</li><li class="res_number">22</li><li class="res_number">35</li><li class="res_number">39</li><li class="res_number" style="background-color: #267FFF; color: #fff;">59</li><li class="res_eng_letter">Z</li></ul>
<ul class="res_allnumber"><li class="res_number">08</li><li class="res_number">15</li><li class="res_number">30</li><li class="res_number">55</li><li class="res_number" style="background-color: #971B7E; color: #fff;">71</li><li class="res_eng_letter">I</li></ul>
<ul class="res_allnumber"><li class="res_number">11</li><li class="res_number">16</li><li class="res_number">50</li><li class="res_number">57</li><li class="res_number" style="background-color: #016B21; color: #fff;">75</li><li class="res_eng_letter">Q</li></ul>
<ul class="res_allnumber"><li class="res_number">27</li><li class="res_number">30</li><li class="res_number">43</li><li class="res_number">71</li><li class="res_number" style="background-color: #267FFF; color: #fff;">63</li><li class="res_eng_letter">E</li></ul>
<ul class="res_allnumber"><li class="res_number">19</li><li class="res_number">20</li><li class="res_number">31</li><li class="res_number">43</li><li class="res_number" style="background-color: #971B7E; color: #fff;">61</li><li class="res_eng_letter">I</li></ul>
<ul class="res_allnumber"><li class="res_number">24</li><li class="res_number">41</li><li class="res_number">47</li><li class="res_number">72</li><li class="res_number" style="background-color: #016B21; color: #fff;">32</li><li class="res_eng_letter">K</li></ul>
<ul class="res_allnumber"><li class="res_number">13</li><li class="res_number">51</li><li class="res_number">61</li><li class="res_number">65</li><li class="res_number" style="background-color: #267FFF; color: #fff;">48</li><li class="res_eng_letter">E</li></ul>