Question

我正在使用以下内容中的某些内容包装库：

import requests
from lxml.html import fromstring

URL = "https://test"
COOKIES = {"test": "AAAAAAAAAAAAA"}
HEADERS = {"Connection": "close", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en-US,en;q=0.9"}

response = requests.get(URL, headers=HEADERS, cookies=COOKIES)
source = fromstring(response.content)

table = source.xpath("")

响应包含大量内容，我试图隔离表中的项目。答复的相关部分是：

<table border="0" cellpadding="0" cellspacing="0" width="100%" class="dialogHdrTbl" summary="Layout table"><thead><tr align="left"><th class="groupHdr"><div class="groupHdr">View Client List</div></th></tr></thead><tbody><tr><td height="1"></td></tr></tbody></table><table width="100%" cellpadding="0" cellspacing="0" border="0" summary="Data table" class="dialogTbl"><tbody><tr class="altRwFlse"><td height="25" headers="hdr1" class="c1">TEST CLIENT 0</td><td height="25" headers="hdr2"><a class="dialogLnk" href="javascript:opener.document.contactForm.company.value=&quot;TEST CLIENT 1&quot;;self.close();" target="">Select</a></td></tr><tr class="altRwTre"><td height="25" headers="hdr1" class="c1">TEST CLIENT 2</td>

我试图输出：

TEST CLIENT 0 TEST CLIENT 1 TEST CLIENT 2

我已经考虑过使用XPATH（基于这篇帖子：How to parse text from a html table element）但是我不太了解如何形成我的xpath查询。我在这里缺少什么？

Answer 1

您可以尝试以下代码来获取所需的输出：

[i.split('value="')[-1].replace('";self.close();', '') for i in source.xpath('//table[@summary="Data table"]//td[not(a)]/text() | //table[@summary="Data table"]//td/a/@href')]

输出应为

['TEST CLIENT 0', 'TEST CLIENT 1', 'TEST CLIENT 2']

使用xpath

1 个答案: