指向某个班级的BeautifulSoup

时间:2015-10-12 13:48:43

标签: python html beautifulsoup html-parsing

我正在尝试让BeautifulSoup获取“已售出单位”列中的数字:

from bs4 import BeautifulSoup
from urllib import urlopen


html = urlopen('http://www.the-numbers.com/home-market/dvd-sales/2007')
soup = BeautifulSoup(html.read(), 'lxml')
units = soup.find_all("td", {"class": "data"})
print(units)

这将输出所有列中的所有信息 - 所以我越来越近了!如何将其缩小到“已售出单位”列以获得结果?

1 个答案:

答案 0 :(得分:2)

如何迭代表格上的行并获取第三个单元格文本:

for row in soup.select("div#page_filling_chart table tr")[1:]:
    cells = row('td')
    print cells[1].get_text(strip=True), cells[2].get_text(strip=True)

此处div#page_filling_chart table tr是一个https://gist.github.com/anonymous/74f7d66daba5920149e4,与trtable元素内div元素匹配id="page_filling_chart"

打印“标题”和“已售出单位”列的内容:

Pirates of the Caribbean - At World's End 13,699,490
Transformers 13,251,378
...
Halloween (2007) 1,172,994
Music and Lyrics 1,158,903