显示网页抓取的内容

时间:2016-02-05 16:53:17

标签: python html beautifulsoup

下面的代码显示了屏幕上的所有字段。有没有办法可以获得字段"旁边"它们将出现在数据库或电子表格中。在源代码中,字段track,date,datetime,grade,distance和prizes可在resultsBlockHeader div类中找到,而Fin(结束位置)Greyhound,Trap,在div resultsBlock中找到SP timeSec和Time Distance。我试图让它们像这样显示 跟踪,日期,日期时间,等级,距离,奖品,鳍,灰狗,陷阱,sp,timeSec,timeDistance都在一行。任何帮助赞赏。

from urllib import urlopen

from bs4 import BeautifulSoup
html = urlopen("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754")

bsObj = BeautifulSoup(html, 'lxml')
nameList = bsObj. findAll("div", {"class": "track"})
for name in nameList:
 print(name. get_text())

nameList = bsObj. findAll("div", {"class": "date"})
for name in nameList:
 print(name. get_text())

 nameList = bsObj. findAll("div", {"class": "datetime"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "grade"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "distance"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "prizes"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "first essential fin"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "essential greyhound"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "trap"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "sp"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "timeSec"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "timeDistance"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "essential trainer"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "first essential comment"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "resultsBlockFooter"})
for name in nameList:
 print(name. get_text())
 nameList = bsObj. findAll("li", {"class": "first essential"})
for name in nameList:
 print(name. get_text())

1 个答案:

答案 0 :(得分:0)

首先,请确保您没有违反网站的Terms of Use - 保持合法的一面。

标记不是很容易抓取,但我要做的是迭代比赛标题和每个标题,获得有关比赛的所需信息。然后,获取兄弟结果块并提取行。开始的示例代码 - 提取轨道和灰狗:

from pprint import pprint
from urllib2 import urlopen

from bs4 import BeautifulSoup


html = urlopen("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754")
soup = BeautifulSoup(html, 'lxml')

rows = []
for header in soup.find_all("div", class_="resultsBlockHeader"):
    track = header.find("div", class_="track").get_text(strip=True)

    results = header.find_next_sibling("div", class_="resultsBlock").find_all("ul", class_="line1")
    for result in results:
        greyhound = result.find("li", class_="greyhound").get_text(strip=True)

        rows.append({
            "track": track,
            "greyhound": greyhound
        })

pprint(rows)

请注意,您在表中看到的每一行实际上都由标记中的3行代表:

<ul class="contents line1">
   ...
</ul>
<ul class="contents line2">
   ...
</ul>
<ul class="contents line3">
   ...
</ul>

greyhound值位于第一个ul(包含line1类)内,您可能需要使用{line2line3 1}}和result.find_next_sibling("ul", class="line2")