Question

下面的代码显示了屏幕上的所有字段。有没有办法可以获得字段＆＃34;旁边＆＃34;它们将出现在数据库或电子表格中。在源代码中，字段track，date，datetime，grade，distance和prizes可在resultsBlockHeader div类中找到，而Fin（结束位置）Greyhound，Trap，在div resultsBlock中找到SP timeSec和Time Distance。我试图让它们像这样显示跟踪，日期，日期时间，等级，距离，奖品，鳍，灰狗，陷阱，sp，timeSec，timeDistance都在一行。任何帮助赞赏。

from urllib import urlopen

from bs4 import BeautifulSoup
html = urlopen("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754")

bsObj = BeautifulSoup(html, 'lxml')
nameList = bsObj. findAll("div", {"class": "track"})
for name in nameList:
 print(name. get_text())

nameList = bsObj. findAll("div", {"class": "date"})
for name in nameList:
 print(name. get_text())

 nameList = bsObj. findAll("div", {"class": "datetime"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "grade"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "distance"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "prizes"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "first essential fin"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "essential greyhound"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "trap"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "sp"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "timeSec"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "timeDistance"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "essential trainer"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("li", {"class": "first essential comment"})
for name in nameList:
 print(name. get_text())
nameList = bsObj. findAll("div", {"class": "resultsBlockFooter"})
for name in nameList:
 print(name. get_text())
 nameList = bsObj. findAll("li", {"class": "first essential"})
for name in nameList:
 print(name. get_text())

Answer 1

首先，请确保您没有违反网站的Terms of Use - 保持合法的一面。

标记不是很容易抓取，但我要做的是迭代比赛标题和每个标题，获得有关比赛的所需信息。然后，获取兄弟结果块并提取行。开始的示例代码 - 提取轨道和灰狗：

from pprint import pprint
from urllib2 import urlopen

from bs4 import BeautifulSoup


html = urlopen("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754")
soup = BeautifulSoup(html, 'lxml')

rows = []
for header in soup.find_all("div", class_="resultsBlockHeader"):
    track = header.find("div", class_="track").get_text(strip=True)

    results = header.find_next_sibling("div", class_="resultsBlock").find_all("ul", class_="line1")
    for result in results:
        greyhound = result.find("li", class_="greyhound").get_text(strip=True)

        rows.append({
            "track": track,
            "greyhound": greyhound
        })

pprint(rows)

请注意，您在表中看到的每一行实际上都由标记中的3行代表：

<ul class="contents line1">
   ...
</ul>
<ul class="contents line2">
   ...
</ul>
<ul class="contents line3">
   ...
</ul>

greyhound值位于第一个ul（包含line1类）内，您可能需要使用{line2和line3 1}}和result.find_next_sibling("ul", class="line2")。

显示网页抓取的内容

1 个答案: