使用scrapy爬网非结构化页面

时间:2017-08-08 14:33:52

标签: scrapy web-crawler bigdata

我正在为Web应用程序实现Spider时遇到困难,因为我的页面根本没有很好的结构。页面上有字段,但有时它们不会出现,并且很难区分它们,因为它们只使用类,而不是id作为标识符。有没有一种方法来优化我们在页面中获取数据的方式?

下面是一个要抓取的网页示例:

    <div class = 'view-activity-field-wrapper even' style = 'display:none' >
       <div class="view-activity-label">Status Notes            <span><img src="/images/helpIcon.png" alt="" width="8" height="10" align="absmiddle" data-tooltip="stickyStatusNotes" /></span>
      </div>
                                    <div class="view-activity-field"></div>     
    </div>
    <div style = 'clear:both'></div>
    <div class = 'view-activity-field-wrapper odd' style = 'display:none'  >
       <div class="view-activity-label">Relevant Question            <span><img src="/images/helpIcon.png" alt="" width="8" height="10" align="absmiddle" data-tooltip="stickyRelevantQuestion" /></span>
      </div>
                                    <div class="view-activity-field"></div>     
    </div>
    <div style = 'clear:both'></div>

    <!-- KEEP VALUE PROVIDED HERE -->
    <div class = 'view-activity-field-wrapper odd' style = 'display:none' >
       <div class="view-activity-label">Value Provided           <span><img src="/images/helpIcon.png" alt="" width="8" height="10" align="absmiddle" data-tooltip="viewvalueprovided" /></span>
       </div>
                                    <div class="view-activity-field"></div>     
    </div>
    <div style = 'clear:both'></div>

0 个答案:

没有答案