无法使用Python scrapy提取数据

时间:2018-09-26 09:21:55

标签: python python-3.x web-scraping scrapy python-3.6

我无法从以下复选框和一个地址字段中抓取数据

<table width=900 cellspacing=0 border=0 cellpadding=5 style='border-top:1px solid silver;border-left:1px solid silver;border-right:1px solid silver;'>
<tr id='row618534' >
   <td style='border-bottom:1px solid silver;background:#ffffff;' padding-bottom :10px;>
      <div id='r618534'>
      <div style='color:red; font-weight:bold; '>
         Warning... Duplicate Found!
      </div>
      <table width=100% border=0 cellpadding=2 cellspacing=0 style='margin-top:15px;border:4px #70797a; border-radius: 5px;'>
         <tr>
            <td style='background:lightgreen; width:55px;' valign=top> 
               <img src='../images/checkwhite.png' style='width:30px;'>
            </td>
            <td style='background:lightgreen;'>
               <input checked type=checkbox name=jobs[] value='618534'>
               <strong>2 Colonial Dr Newport Beach CA  92660</strong> &nbsp; &nbsp;   
            <td style='background:lightgreen;' align=right><input type='hidden' id='miles618534'><span style='margin-left:0px;' onclick="sub618534()"  class='button_input'> Process this order</span></span></td>
         <tr>
            <td>Your Input</td>
            <td  style='padding-left:28px;'>2 COLONIAL DR NEWPORT BEACH CA 92660</td>
            <td align=right><a href='customer_multi_jobs_review.php?del=1&djob=NjE4NTM0' style='color:blue;'><b><img title='Remove / Delete Order' src='../images/deletorder.png' style='width:30px;'></b></a></td>
         </tr>
      </table>
      <div style=' margin-left:40px;'>
      Exterior BPO - Light Photo Set (3 photos*)  &nbsp; &nbsp; &nbsp; <br>$9.00 &nbsp; &nbsp; &nbsp; We found a rep 4.6 miles from order.  &nbsp; &nbsp; &nbsp;  &nbsp; &nbsp; &nbsp; <span style='color:silver'> Resolution 640x480  &nbsp; &nbsp; &nbsp;  GPS REQUIRED:  Yes  <span style='margin-left:10px;'>Datestamped </span> </span><br clear=all>
      <div style=float:left;'>
来自input checked type=checkbox name=jobs[] value='618534'>

Id 文字“您的输入”之后的地址

我尝试了多种方法,但仅获得了ID,但无法捕获地址详细信息。 请在下面找到我的代码

for input_node in response.xpath('//input[@name="jobs[]"]'):
    id = input_node.xpath(./@value).extract_first()
    address = input_node.xpath('./following-sibling::table[1]//td[.="Your Input"]/following-sibling::td[1]/text()').extract_first()

1 个答案:

答案 0 :(得分:0)

尝试以下方法。它会为您获取所需的必填字段。

from scrapy import Selector

htmldoc = """
<table width=900 cellspacing=0 border=0 cellpadding=5 style='border-top:1px solid silver;border-left:1px solid silver;border-right:1px solid silver;'><tr id='row618534' ><td style='border-bottom:1px solid silver;background:#ffffff;' padding-bottom :10px;><div id='r618534'><div style='color:red; font-weight:bold; '>Warning... Duplicate Found!</div> <table width=100% border=0 cellpadding=2 cellspacing=0 style='margin-top:15px;border:4px #70797a; border-radius: 5px;'><tr><td style='background:lightgreen; width:55px;' valign=top><img src='../images/checkwhite.png' style='width:30px;'></td><td style='background:lightgreen;'><input checked type=checkbox name=jobs[] value='618534'>  <strong>2 Colonial Dr Newport Beach CA  92660</strong> &nbsp; &nbsp;   <td style='background:lightgreen;' align=right><input type='hidden' id='miles618534'><span style='margin-left:0px;' onclick="sub618534()"  class='button_input'> Process this order</span></span></td><tr><td>Your Input</td><td  style='padding-left:28px;'>2 COLONIAL DR NEWPORT BEACH CA 92660</td><td align=right><a href='customer_multi_jobs_review.php?del=1&djob=NjE4NTM0' style='color:blue;'><b><img title='Remove / Delete Order' src='../images/deletorder.png' style='width:30px;'></b></a></td></tr></table><div style=' margin-left:40px;'> Exterior BPO - Light Photo Set (3 photos*)  &nbsp; &nbsp; &nbsp; <br>$9.00 &nbsp; &nbsp; &nbsp; We found a rep 4.6 miles from order.  &nbsp; &nbsp; &nbsp;  &nbsp; &nbsp; &nbsp; <span style='color:silver'> Resolution 640x480  &nbsp; &nbsp; &nbsp;  GPS REQUIRED:  Yes  <span style='margin-left:10px;'>Datestamped </span> </span><br clear=all><div style=float:left;'>
"""
sel = Selector(text=htmldoc)
for input_node in sel.xpath('//tr//input[@name="jobs[]"]'):
    id_num =  input_node.xpath('./@value').extract_first()
    address = input_node.xpath('.//following::td[contains(text(),"Your Input")]//following-sibling::td//text()').extract_first().strip()
    print(f'{id_num}\n{address}')

它产生的输出:

618534
2 COLONIAL DR NEWPORT BEACH CA 92660