改进和简化python BeautifulSoup代码

时间:2017-01-05 03:24:57

标签: python python-3.x beautifulsoup simplify

我有这个代码使用BeautifulSoup从网站收集一些数据

import requests
from bs4 import BeautifulSoup

url = "http://hearthstone.gamepedia.com/Patches"
page = requests.get(url)
soup = BeautifulSoup(page.content,"html.parser")

variable = soup.find('div',{"id":"mw-content-text"})
variable = variable.find_all('ul')[2]
variable = variable.find('li')
variable = variable.find_all('a')[1]

print(variable.text)

输出应为:

Patch 7.0.0.15590

按此顺序,我能够找到我想要的确切标签。

为了简化它,我怎么能将它作为单行代码?

Variable = harsoup.find('div',{"id":"mw-content-text"}).find_all('ul')[2].find('li').find_all('a')[1]

我想要实现这样的目标,但它似乎以同样的方式运作。

1 个答案:

答案 0 :(得分:0)

soup.find_all(href=re.compile(r'/Patch_'))

出:

[<a href="/Patch_7.0.0.15590" title="Patch 7.0.0.15590">Patch 7.0.0.15590</a>,
 <a href="/Patch_6.2.0.15300" title="Patch 6.2.0.15300">Patch 6.2.0.15300</a>,
 <a href="/Patch_6.2.0.15181" title="Patch 6.2.0.15181">Patch 6.2.0.15181</a>,
 <a href="/Patch_6.1.3.14830" title="Patch 6.1.3.14830">Patch 6.1.3.14830</a>,
 <a href="/Patch_6.1.1.14406" title="Patch 6.1.1.14406">Patch 6.1.1.14406</a>,
 <a href="/Patch_6.0.0.13921" title="Patch 6.0.0.13921">Patch 6.0.0.13921</a>,
 <a href="/Patch_5.2.2.13807" title="Patch 5.2.2.13807">Patch 5.2.2.13807</a>,
 <a href="/Patch_5.2.0.13740" title="Patch 5.2.0.13740">Patch 5.2.0.13740</a>,
 <a href="/Patch_5.2.0.13714" title="Patch 5.2.0.13714">Patch 5.2.0.13714</a>,
 <a href="/Patch_5.2.0.13619" title="Patch 5.2.0.13619">Patch 5.2.0.13619</a>,
 <a href="/Patch_5.0.0.13030" title="Patch 5.0.0.13030">Patch 5.0.0.13030</a>,
 <a href="/Patch_5.0.0.12574" title="Patch 5.0.0.12574">Patch 5.0.0.12574</a>,
 <a href="/Patch_4.3.0.12266" title="Patch 4.3.0.12266">Patch 4.3.0.12266</a>,
 <a href="/Patch_4.2.0.12051" title="Patch 4.2.0.12051">Patch 4.2.0.12051</a>,
 <a href="/Patch_4.1.0.10956" title="Patch 4.1.0.10956">Patch 4.1.0.10956</a>,
 <a href="/Patch_4.0.0.10833" title="Patch 4.0.0.10833">Patch 4.0.0.10833 - The League of Explorers</a>,
 <a href="/Patch_3.2.0.10604" title="Patch 3.2.0.10604">Patch 3.2.0.10604</a>,
 <a href="/Patch_3.1.0.10357" title="Patch 3.1.0.10357">Patch 3.1.0.10357</a>,
 <a href="/Patch_3.0.0.9786" title="Patch 3.0.0.9786">Patch 3.0.0.9786 - The Grand Tournament Draws Near</a>,
 <a href="/Patch_2.8.0.9554" title="Patch 2.8.0.9554">Patch 2.8.0.9554</a>,
 <a href="/Patch_2.7.0.9166" title="Patch 2.7.0.9166">Patch 2.7.0.9166</a>,
 <a href="/Patch_2.6.0.8834" title="Patch 2.6.0.8834">Patch 2.6.0.8834</a>,

使用re来存档您想要的标记。

可以在find()find_all()中使用五个filters

  1. 一个字符串
  2. 正则表达式
  3. 列表
  4. 功能