使用beautifulsoup从HTML中提取

时间:2015-02-12 10:43:15

标签: python xml beautifulsoup

我尝试从https://www.lotto.de/de/ergebnisse/lotto-6aus49/archiv.html提取乐透号码(我知道有一种更简单的方法,但它更适合学习)。

These are the numbers I want to extract:

尝试使用Python,以下内容:

from BeautifulSoup import BeautifulSoup
import urllib2

url="https://www.lotto.de/de/ergebnisse/lotto-6aus49/archiv.html"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
numbers=soup.findAll('li',{'class':'winning_numbers.boxRow.clearfix'})

for number in numbers:
    print number['li']+","+number.string

什么都没有,我实际上是期望的。我阅读了教程,但仍然完全不了解解析。有人能给我一个暗示吗?

谢谢!

1 个答案:

答案 0 :(得分:3)

由于数据内容是动态生成的,因此 EASIER 解决方案之一可以使用 Selenium 等来模拟作为浏览器的操作(我使用 PhantomJS as webdriver),如下所示:

from selenium import webdriver

url="https://www.lotto.de/de/ergebnisse/lotto-6aus49/archiv.html"
# I'm using PhantomJS, you may use your own...
driver = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs')
driver.get(url)
soup = BeautifulSoup(driver.page_source)
# I just simply go through the div class and grab all number texts
# without special number, like in the Sample
for ul in soup.findAll('div', {'class': 'winning_numbers'}):
    n = ','.join(li for li in ul.text.split() if li.isdigit())
    if n:
        print 'number: {}'.format(n)

number: 6,25,26,27,28,47

还要抓住特殊号码:

for ul in soup.findAll('div', {'class': 'winning_numbers'}):
    # grab only numeric chars, you may apply your own logic here
    n = ','.join(''.join(_ for _ in li if _.isdigit()) for li in ul.text.split())
    if n:
        print 'number: {}'.format(n)

number: 6,25,26,27,28,47,5 # with special number

希望这有帮助。