下面的代码显示了屏幕上的所有字段。有没有办法可以获得字段"旁边"它们将出现在数据库或电子表格中。在源代码中,字段track,date,datetime,grade,distance和prizes可在resultsBlockHeader div类中找到,而Fin(结束位置)Greyhound,Trap,在div resultsBlock中找到SP timeSec和Time Distance。我试图让它们像这样显示 跟踪,日期,日期时间,等级,距离,奖品,鳍,灰狗,陷阱,sp,timeSec,timeDistance都在一行。任何帮助赞赏。
from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754")
bsObj = BeautifulSoup(html, 'lxml')
nameList = bsObj. findAll("div", {"class": "track"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("div", {"class": "date"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("div", {"class": "datetime"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("div", {"class": "grade"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("div", {"class": "distance"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("div", {"class": "prizes"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "first essential fin"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "essential greyhound"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "trap"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "sp"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "timeSec"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "timeDistance"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "essential trainer"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "first essential comment"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("div", {"class": "resultsBlockFooter"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "first essential"})
for name in nameList:
print(name. get_text())
答案 0 :(得分:0)
首先,请确保您没有违反网站的Terms of Use - 保持合法的一面。
标记不是很容易抓取,但我要做的是迭代比赛标题和每个标题,获得有关比赛的所需信息。然后,获取兄弟结果块并提取行。开始的示例代码 - 提取轨道和灰狗:
from pprint import pprint
from urllib2 import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754")
soup = BeautifulSoup(html, 'lxml')
rows = []
for header in soup.find_all("div", class_="resultsBlockHeader"):
track = header.find("div", class_="track").get_text(strip=True)
results = header.find_next_sibling("div", class_="resultsBlock").find_all("ul", class_="line1")
for result in results:
greyhound = result.find("li", class_="greyhound").get_text(strip=True)
rows.append({
"track": track,
"greyhound": greyhound
})
pprint(rows)
请注意,您在表中看到的每一行实际上都由标记中的3行代表:
<ul class="contents line1">
...
</ul>
<ul class="contents line2">
...
</ul>
<ul class="contents line3">
...
</ul>
greyhound
值位于第一个ul
(包含line1
类)内,您可能需要使用{line2
和line3
1}}和result.find_next_sibling("ul", class="line2")
。