Question

我正在使用scrapy编写爬虫程序，并且通过使用以下python行，我设法获得了所需的数据：

Python行：

response.css("article.college div.span8.profile > table > tbody > tr").extract()

它返回以下结果：

['<tr>\n<th>Institution Name:</th>\n<td>Harvard University</td>\n</tr>',
 '<tr>\n<th>Administration</th>\n<td>Private</td>\n</tr>',
 '<tr>\n<th>State</th>\n<td>\nMassachussets\t\n</td>\n</tr>']

但是，我想按属性名称访问属性值索引。我想做这样的事情：

response.css(<magic containing 'Institution Name'>)

并且能够检索对应的值，在这种情况下，是这样的：

\n<td>Harvard University</td>\n

有人可以帮我解决这个问题吗？

谢谢

Answer 1

您可以尝试使用XPath：

response.xpath('//tr[th="Institution Name:"]/td/text()').extract()

Answer 2

我正在将您的提取器修改为xpath：

response.xpath("//table//tbody//tr[contains(., 'Institution Name')]/td/text()").extract()

我刚刚添加了任何包含tr文本（区分大小写）的Institution Name，然后从该td中选取tr

Answer 3

在这种情况下，我使用像这样的列表理解

institution_name = [line.css("td").extract_first() for line in response.css("article.college div.span8.profile > table > tbody > tr") if "Institution Name" in line.extract()]

如何创建CSS选择器以使用th的内容选择td的内容？

3 个答案: