请参见下面的html表
<table width=900 cellspacing=0 border=0 cellpadding=5 style='border-top:1px solid silver;border-left:1px solid silver;border-right:1px solid silver;'>
<tr >
<td style='border-bottom:1px solid silver;background:#ffffff;'>
<input checked type=checkbox name=jobs[] value='610974'>
<table border=0 cellpadding=2 cellspacing=0 style='border:4px #70797a; border-radius: 5px;'>
<tr>
<td style='background:lightgreen;' valign=top>
<img src='../images/checkwhite.png' style='width:30px;'>
</td>
<td style='background:lightgreen;'> 123 Charter Rd Wethersfield CT 06109 </td>
<tr>
<td>Your Input</td>
<td>123 CHARTER RD WETHERSFIELD CT 06109</td>
</tr>
</table>
<br clear=all>
<div style='margin-left:40px;'>09/11/2018
<br>Exterior BPO - Light Photo Set (3 photos*)
<br>$9.00 We found a rep 6.2 miles from job.
<span style='color:silver'> 640x480 Add Datestamp, </span>
<br clear=all>
<div style=float:left;'></div>
</div>
</td>
</td>
<tr >
<td style='border-bottom:1px solid silver;background:#ffffff;'>
<div style='color:red; font-weight:bold; '>Warning... Duplicate Found!</div>
<input checked type=checkbox name=jobs[] value='610975'>
<table border=0 cellpadding=2 cellspacing=0 style='border:4px #70797a; border-radius: 5px;'>
<tr>
<td style='background:lightgreen;' valign=top>
<img src='../images/checkwhite.png' style='width:30px;'>
</td>
<td style='background:lightgreen;'> 123 Charter Rd Wethersfield CT 06109 </td>
<tr>
<td>Your Input</td>
<td>123 CHARTER RD WETHERSFIELD CT 06109</td>
</tr>
</table>
<br clear=all>
<div style='margin-left:40px;'>09/11/2018
<br>Exterior BPO - Light Photo Set (3 photos*)
<br>$9.00 We found a rep 6.2 miles from job.
<span style='color:silver'> 640x480 Add Datestamp, </span>
<br clear=all>
<div style=float:left;'>
我需要输出为:
id =“ 610974”和Address =“ 123 CHARTER RD WETHERSFIELD CT 06109” [第一个复选框的值是id和相应的地址] id =“ 610975”和Address =“ 123 CHARTER RD WETHERSFIELD CT 06109” [Ist复选框的值为id和相应的地址] 等等...
soup = BeautifulSoup(bodystrip, "lxml")
for tr in response.find_all('tr'):
tds = tr.find_all('td')
print(tds[0].text)
jobid = tds[0].find('input')
print(jobid)
这是地址错误,是正确获取
答案 0 :(得分:0)
使用Scrapy:
for input_node in response.xpath('//input[@name="jobs[]"]'):
id = input_node.xpath(./@value).extract_first()
address = input_node.xpath('./following-sibling::table[1]//td[.="Your Input"]/following-sibling::td[1]/text()').extract_first()
答案 1 :(得分:0)
使用beautifulsoup应该可以工作:
for job in soup.find_all('input',attrs={"type":"checkbox"}):
print(job['value'])
print(job.parent.find_all('td',attrs={'style':True})[1].text)