Question

我正在尝试解析此页面上具有相同层次结构的所有链接。我没有得到任何追溯，但也没有得到数据。

我正在尝试从突出显示的代码部分获取href标记：

我目前的代码是：

def link_parser(soup,itemsList):
for item in soup.findAll("div", { "class" : "tileInfo" }):
    for link in item.findAll("a", { "class" : "productClick productTitle" }):
        try:
            itemsList.put(removeNonAscii(html_parser.unescape(link.string)).replace(',',' ')+","+clean_a_url(link['href']))
        except Exception:
            print "Formatting error: "
            traceback.print_exc(file=sys.stdout)

return ""

Answer 1

看起来你正试图抓住Target的网站 - 也许是this page。

你遇到了网页抓取的一个基本困难 - 你看到的是而不是总是你得到的。在这种情况下，在加载页面后，它们在一堆内容中是AJAXing。注意第一次加载页面时的小风车动画 - 您尝试访问的内容在DOM中不存在，直到他们在该页面上运行的所有各种js脚本运行。（而且他们有全部）

我点击了一下，看起来负责生成该内容的代码就是jquery：

   <script id="productTitleTmpl" type="text/x-jquery-tmpl" >
        {{if $item.parent.parent.viewType != "details"}}
            {{tmpl($data.itemAttributes) "#productBrandTmpl"}}
        {{/if}}
        <a class="productClick productTitle" id="prodTitle-{{= $item.parent.parent.viewType}}-{{= $item.parent.parent.currentPageNumber}}-{{= $item.parent.productCounter}}" href="/{{= productDetailPageURL}}#prodSlot={{= $item.parent.parent.viewType}}_{{= $item.parent.parent.currentPageNumber}}_{{= $item.parent.productCounter}}" title="{{= title}}" name="prodTitle_{{= $item.catalogEntryId}}">
            {{= $item.parent.parent.fetchProductTitleForView($item.productTitle)}}
        </a>

所以，无论如何。如果您在抓取此页面时确实已经死定，则需要放弃urllib（或者您用于获取html的任何内容）。而是使用支持javascript的无头浏览器（如selenium）访问此页面，让javascript运行，然后抓取它。所有这些都超出了这个答案的范围，但你可以浏览各种无头浏览器解决方案，找到适合你的解决方案。

Python 2.7 Beautiful Soup-解析链接列表

1 个答案: