我正在使用以下内容中的某些内容包装库:
import requests
from lxml.html import fromstring
URL = "https://test"
COOKIES = {"test": "AAAAAAAAAAAAA"}
HEADERS = {"Connection": "close", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en-US,en;q=0.9"}
response = requests.get(URL, headers=HEADERS, cookies=COOKIES)
source = fromstring(response.content)
table = source.xpath("")
响应包含大量内容,我试图隔离表中的项目。答复的相关部分是:
<table border="0" cellpadding="0" cellspacing="0" width="100%" class="dialogHdrTbl" summary="Layout table"><thead><tr align="left"><th class="groupHdr"><div class="groupHdr">View Client List</div></th></tr></thead><tbody><tr><td height="1"></td></tr></tbody></table><table width="100%" cellpadding="0" cellspacing="0" border="0" summary="Data table" class="dialogTbl"><tbody><tr class="altRwFlse"><td height="25" headers="hdr1" class="c1">TEST CLIENT 0</td><td height="25" headers="hdr2"><a class="dialogLnk" href="javascript:opener.document.contactForm.company.value="TEST CLIENT 1";self.close();" target="">Select</a></td></tr><tr class="altRwTre"><td height="25" headers="hdr1" class="c1">TEST CLIENT 2</td>
我试图输出:
TEST CLIENT 0 TEST CLIENT 1 TEST CLIENT 2
我已经考虑过使用XPATH(基于这篇帖子:How to parse text from a html table element)但是我不太了解如何形成我的xpath查询。我在这里缺少什么?
答案 0 :(得分:0)
您可以尝试以下代码来获取所需的输出:
[i.split('value="')[-1].replace('";self.close();', '') for i in source.xpath('//table[@summary="Data table"]//td[not(a)]/text() | //table[@summary="Data table"]//td/a/@href')]
输出应为
['TEST CLIENT 0', 'TEST CLIENT 1', 'TEST CLIENT 2']