我正在为Web应用程序实现Spider时遇到困难,因为我的页面根本没有很好的结构。页面上有字段,但有时它们不会出现,并且很难区分它们,因为它们只使用类,而不是id作为标识符。有没有一种方法来优化我们在页面中获取数据的方式?
下面是一个要抓取的网页示例:
<div class = 'view-activity-field-wrapper even' style = 'display:none' >
<div class="view-activity-label">Status Notes <span><img src="/images/helpIcon.png" alt="" width="8" height="10" align="absmiddle" data-tooltip="stickyStatusNotes" /></span>
</div>
<div class="view-activity-field"></div>
</div>
<div style = 'clear:both'></div>
<div class = 'view-activity-field-wrapper odd' style = 'display:none' >
<div class="view-activity-label">Relevant Question <span><img src="/images/helpIcon.png" alt="" width="8" height="10" align="absmiddle" data-tooltip="stickyRelevantQuestion" /></span>
</div>
<div class="view-activity-field"></div>
</div>
<div style = 'clear:both'></div>
<!-- KEEP VALUE PROVIDED HERE -->
<div class = 'view-activity-field-wrapper odd' style = 'display:none' >
<div class="view-activity-label">Value Provided <span><img src="/images/helpIcon.png" alt="" width="8" height="10" align="absmiddle" data-tooltip="viewvalueprovided" /></span>
</div>
<div class="view-activity-field"></div>
</div>
<div style = 'clear:both'></div>