Question

我已经在网站上挖了一段时间了：

http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution

我需要从每所大学的硕士学位下提取数据。

您可能已经注意到并非每个大学都有硕士的数据，所以我需要跟踪它。

如何在这种情况下跟踪数据？

到目前为止，我的python使用XPATH代码：

import __future__
from lxml import html
import requests
from bs4 import BeautifulSoup

page = requests.get('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution')

soup = str(BeautifulSoup(page.content, 'html.parser'))

tree = html.fromstring(soup)

for table in tree.xpath('//table[@width="95%" and @align="center" and @class="center"]'):
    print('-- NEW TABLE -- \n')
    tab = table.xpath('.//table[@width="260px"]/tr/td[@style="width: 100%;"]/text()')
    print(tab)

print('Ready !!')

如您所见，它打印-- NEW TABLE --，但tab变量是一个空数组。

tab变量应该由学士学位，硕士和护理实践博士每张桌子。

Answer 1

尝试：

for table in tree.xpath('(//tr[ td[span="Baccalaureate"] or td[contains(span,"Master")] ]/ancestor::tr[1])'):
  print('-- NEW TABLE -- \n')
  tab = table.xpath('.//table[@width="260px"]/tr/td[@style="width: 100%;"]/text()')
  print(tab)

Answer 2

您可以使用以下xpath来提取Master的数据。

//span[contains(text(),'Master')]/parent::td[1]

XPath跟踪每个表中的数据

2 个答案: