Question

我正在寻找一种方法来检索此网站的广告网址。 http://www.quiltingboard.com/resources/

我想要做的就是编写一个脚本来不断刷新页面并获取广告网址。

有什么建议吗？

Answer 1

Beautiful Soup正是您所寻找的。

Answer 2

我假设你在谈论屏幕顶部的文字广告。

您将无法直接使用Python解析库，因为这些链接是在页面加载后使用JavaScript加载的。

一种选择是使用selenium之类的工具，允许您在浏览器中加载页面。完成后，您可以使用BeautifulSoup扫描您要查找的链接：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)    
ads_div = soup.find('div', attrs={'class': 'ads'})
if ads_div:
    for link in ads_div.find_all('a'):
        print link['href']

Answer 3

单独的BeautifulSoup不会削减它。广告是通过javascript注入的（它们是双击广告）。

您的选择是：

脚本之类的东西，如selenium，在页面加载后10-15秒查找网址
如果你留在纯python中，你需要：
1. 用美丽的汤来解析初始请求
2. 弄清楚谷歌将用javascript注入什么
3. 发出二次请求以双击iframe或有效内容网址

这些方法只会为您提供处理转化跟踪的双击网址。如果您想知道他们重定向到哪里，您需要打开这些网址以发现他们的重定向。

Answer 4

我会查看Scrapy。

这是一个网络抓取库。它允许您非常轻松地从网站抓取和抓取信息。上面的链接是官方教程，其中包含大量示例代码，包括跨越您所需内容的内容。

根据网站进行简单的抓取：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from tutorial.items import DmozItem

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["dmoz.org"]
   start_urls = [
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//ul/li')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.select('a/text()').extract()
           item['link'] = site.select('a/@href').extract()
           item['desc'] = site.select('text()').extract()
           items.append(item)
       return items

这是使用库来抓取链接的一个小例子，它产生：

[dmoz] DEBUG：从＆lt; 200中删除 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/＆GT; {'desc'：[u' - David Mertz;艾迪生韦斯利。正在预订，全文，ASCII格式。请求反馈。 [作者网站，Gnosis Software，Inc。\ n]， 'link'：[u'http：//gnosis.cx/TPiP/']， 'title'：[u'Text Processing in Python']} [dmoz] DEBUG：Scraped from＆lt; 200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/＆GT; {'desc'：[u' - Sean McGrath; Prentice Hall PTR，2000，ISBN 0130211192，具有CD-ROM。快速构建XML应用程序的方法，Python 教程，DOM和SAX，新的Pyxie开源XML处理库。 [Prentice Hall PTR] \ n']， 'link'：[u'http：//www.informit.com/store/product.aspx？isbn = 0130211192']， 'title'：[u'XML Processing with Python']}

非常酷。

正在检索广告网址

4 个答案: