无法从每个表tr中取消复选框值

时间:2018-09-10 05:03:10

标签: python python-3.x web-scraping beautifulsoup scrapy

请参见下面的html表

<table width=900 cellspacing=0 border=0 cellpadding=5 style='border-top:1px solid silver;border-left:1px solid silver;border-right:1px solid silver;'>
    <tr >
        <td style='border-bottom:1px solid silver;background:#ffffff;'>
            <input checked type=checkbox name=jobs[] value='610974'>
                <table border=0 cellpadding=2 cellspacing=0 style='border:4px #70797a; border-radius: 5px;'>
                    <tr>
                        <td style='background:lightgreen;' valign=top>
                            <img src='../images/checkwhite.png' style='width:30px;'>
                            </td>
                            <td style='background:lightgreen;'> 123 Charter Rd Wethersfield CT  06109 &nbsp; &nbsp;  </td>
                            <tr>
                                <td>Your Input</td>
                                <td>123 CHARTER RD WETHERSFIELD CT 06109</td>
                            </tr>
                        </table>
                        <br clear=all>
                            <div style='margin-left:40px;'>09/11/2018 &nbsp; &nbsp; &nbsp; 
                                <br>Exterior BPO - Light Photo Set (3 photos*)  &nbsp; &nbsp; &nbsp; 
                                    <br>$9.00 &nbsp; &nbsp; &nbsp; We found a rep 6.2 miles from job.  &nbsp; &nbsp; &nbsp;  &nbsp; &nbsp; &nbsp; 
                                        <span style='color:silver'> 640x480  &nbsp; &nbsp; &nbsp; Add Datestamp, </span>
                                        <br clear=all>
                                            <div style=float:left;'></div>
                                        </div>
                                    </td>
                                </td>
                                <tr >
                                    <td style='border-bottom:1px solid silver;background:#ffffff;'>
                                        <div style='color:red; font-weight:bold; '>Warning... Duplicate Found!</div>
                                        <input checked type=checkbox name=jobs[] value='610975'>
                                            <table border=0 cellpadding=2 cellspacing=0 style='border:4px #70797a; border-radius: 5px;'>
                                                <tr>
                                                    <td style='background:lightgreen;' valign=top>
                                                        <img src='../images/checkwhite.png' style='width:30px;'>
                                                        </td>
                                                        <td style='background:lightgreen;'> 123 Charter Rd Wethersfield CT  06109 &nbsp; &nbsp;  </td>
                                                        <tr>
                                                            <td>Your Input</td>
                                                            <td>123 CHARTER RD WETHERSFIELD CT 06109</td>
                                                        </tr>
                                                    </table>
                                                    <br clear=all>
                                                        <div style='margin-left:40px;'>09/11/2018 &nbsp; &nbsp; &nbsp; 
                                                            <br>Exterior BPO - Light Photo Set (3 photos*)  &nbsp; &nbsp; &nbsp; 
                                                                <br>$9.00 &nbsp; &nbsp; &nbsp; We found a rep 6.2 miles from job.  &nbsp; &nbsp; &nbsp;  &nbsp; &nbsp; &nbsp; 
                                                                    <span style='color:silver'> 640x480  &nbsp; &nbsp; &nbsp; Add Datestamp, </span>
                                                                    <br clear=all>
                                                                        <div style=float:left;'>

我需要输出为:

  

id =“ 610974”和Address =“ 123 CHARTER RD WETHERSFIELD CT 06109” [第一个复选框的值是id和相应的地址]   id =“ 610975”和Address =“ 123 CHARTER RD WETHERSFIELD CT 06109” [Ist复选框的值为id和相应的地址]   等等...

soup = BeautifulSoup(bodystrip, "lxml")
for tr in response.find_all('tr'):
       tds = tr.find_all('td')
       print(tds[0].text)
       jobid = tds[0].find('input')
       print(jobid)

这是地址错误,是正确获取

2 个答案:

答案 0 :(得分:0)

使用Scrapy:

for input_node in response.xpath('//input[@name="jobs[]"]'):
    id = input_node.xpath(./@value).extract_first()
    address = input_node.xpath('./following-sibling::table[1]//td[.="Your Input"]/following-sibling::td[1]/text()').extract_first()

答案 1 :(得分:0)

使用beautifulsoup应该可以工作:

for job in soup.find_all('input',attrs={"type":"checkbox"}):
    print(job['value'])
    print(job.parent.find_all('td',attrs={'style':True})[1].text)