使用beautifulsoup从维基百科表中获取列

时间:2014-11-06 20:43:59

标签: python python-3.x beautifulsoup html-parsing

source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.text)
tables = soup.find_all("table")

我试图从表格中获取歌曲名称列表"单曲列表"在Taylor Swift's discography

该表没有唯一的类或ID。我唯一能想到的独特之处就是周围的标题标签"单身榜单......"

  

作为主要艺术家的单身人士名单,包括选定的图表位置,销售数字和证明

我试过了:

table = soup.find_all("caption")

但它没有返回任何内容,我假设标题不是bs4中的识别标记?

2 个答案:

答案 0 :(得分:3)

实际上它与findAll()find_all()无关。 findAll()中使用BeautifulSoup3,出于兼容性原因留在BeautifulSoup4 ,引自bs4的源代码:< / p>

def find_all(self, name=None, attrs={}, recursive=True, text=None,
             limit=None, **kwargs):
    generator = self.descendants
    if not recursive:
        generator = self.children
    return self._find_all(name, attrs, text, limit, generator, **kwargs)

findAll = find_all       # BS3

而且,有一种更好的方式来获取单身列表,依靠span元素和id="Singles"来表示Singles段的开头。然后,使用find_next_sibling()获取span代码父级之后的第一个表格。然后,使用th获取所有scope="row"元素:

from bs4 import BeautifulSoup
import requests


source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.content)

table = soup.find('span', id='Singles').parent.find_next_sibling('table')
for single in table.find_all('th', scope='row'):
    print(single.text)

打印:

"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
"Mine"
"Back to December"
"Mean"
"The Story of Us"
"Sparks Fly"
"Ours"
"Safe & Sound"
(featuring The Civil Wars)
"Long Live"
(featuring Paula Fernandes)
"Eyes Open"
"We Are Never Ever Getting Back Together"
"Ronan"
"Begin Again"
"I Knew You Were Trouble"
"22"
"Highway Don't Care"
(with Tim McGraw)
"Red"
"Everything Has Changed"
(featuring Ed Sheeran)
"Sweeter Than Fiction"
"The Last Time"
(featuring Gary Lightbody)
"Shake It Off"
"Blank Space"

答案 1 :(得分:1)

这是一个完整的例子,解决了“泰勒斯威夫特问题”。首先查找包含文本“单个列表”并移动到父对象的标题“。接下来迭代包含您要查找的文本的项目:

for caption in soup.findAll("caption"):
    if "List of singles" in caption.text:      
        break

table = caption.parent
for item in table.findAll("th", {"scope":"row"}):
    print item.text

这给出了:

"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
...