Question

我添加了一个我想要抓取的HTML片段。

我想浏览每一行（tbody）并使用xml抓取相关数据。

每行的xss可以通过以下方式找到：

//*[@id="re_"]/table/tbody

但我不确定如何在python中设置它来循环遍历每个tbody？ tbody行没有设置编号，因此可以是任意数字。例如

 for each tbody:   
      ...get data

下面是HTML页面

http://www.racingpost.com/horses/result_home.sd?race_id=651402&r_date=2016-06-07&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS

Answer 1

使用 lxml ，您可以使用类名直接拉表，并使用xpath //table[@class="grid resultRaceGrid"]/tbody

提取所有tbody标签

from lxml import html

x = html.parse("http://www.racingpost.com/horses/result_home.sd?race_id=651402&r_date=2016-06-07&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS")

tbodys= x.xpath('//table[@class="grid resultRaceGrid"]/tbody')
# iterate over the list of tbody tags
for tbody in tbodys:
    # get all the rows from the tbody
    for row in tbody.xpath("./tr"):
        # extract the tds and do whatever you want.
        tds = row.xpath("./td")
        print(tds)

显然你可以更具体，td标签有你可以用来提取的类名，而且一些tr标签也有类。

Answer 2

我认为您对BeautifulSoup感兴趣。

根据您的数据，如果您想要打印所有评论文本，它将如下所示：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

for tbody in soup.find_all('tbody'):
    print tbody.find('.commentText').get_text()

你可以做更多花哨的东西。你可以read more here。

Python xml - 如何循环获取数据

2 个答案: