Python XML字符串重组

时间:2015-02-12 16:53:46

标签: python xml selenium beautifulsoup

我已经获得了以下用于从网络中提取乐透号码的代码:

from BeautifulSoup import BeautifulSoup

from selenium import webdriver

lottonumbers=[]

url="https://www.lotto.de/de/ergebnisse/lotto-6aus49/archiv.html"
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)

for ul in soup.findAll("div", {"class": "winning_numbers boxRow clearfix"}):
    n = ','.join(''.join(_ for _ in li if _.isdigit()) for li in ul.text.split())
    if n:
        print format(n)

返回:625262728475

应该是:6,25,26,27,28,47,5

缺少逗号。应将每个数字优先写入列表lottonumbers。 任何人都可以帮忙吗?

2 个答案:

答案 0 :(得分:0)

您可能会指定一个lottonumbers的空列表,并将 n 附加到另一个for/loop下,如下所示:

# ... previous code ...
lottonumbers = []
for ul in soup.findAll("div", {"class": "winning_numbers boxRow clearfix"}):
    for li in ul.text.split():
        n = ''.join(_ for _ in li if _.isdigit())
        if n:
            lottonumbers.append(int(n))
print lottonumbers
[6, 25, 26, 27, 28, 47, 5]

答案 1 :(得分:0)

与此同时,我开始使用黑客(正则表达式)。

from BeautifulSoup import BeautifulSoup
from selenium import webdriver
import re

url="https://www.lotto.de/de/ergebnisse/lotto-6aus49/archiv.html"
driver = webdriver.PhantomJS(executable_path="C://Users//Royskatt//Downloads//phantomjs-2.0.0-windows//bin//phantomjs.exe")
#driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)

lottonumbers = []

for ul in soup.findAll("div", {"class": "winning_numbers boxRow clearfix"}):
    for i in re.findall(r'(?<=zahl\([1-6]\)">)\d{1,2}|(?<="last">)\d', str(ul)):
        lottonumbers.append(i)

print lottonumbers

我发现,使用不同的网络驱动程序会导致动态生成的HTML存在细微差别。例如,当使用PhantomJS时,我得到以下结果:
['6', '25', '26', '27', '28', '47', '5']
而webdriver.Firefox给了我

['6', '25', '26', '27', '28', '47']

可能我们遇到的不同输出虽然具有相同的代码是由不同版本的webdrivers引起的。例如,我的案例中的相关乐透号码是用很长的单行写的,没有换行符。