Question

我正在尝试解析html结果，抓取几个网址，然后解析访问这些网址的输出。

我正在使用django 1.5 / python 2.7：

views.py

    #mechanize/beautifulsoup config options here.
     beautifulSoupObj = BeautifulSoup(mechanizeBrowser.response().read()) #read the raw response
     getFirstPageLinks = beautifulSoupObj.find_all('cite') #get first page of urls

url_data = UrlData(NumberOfUrlsFound, getDomainLinksFromGoogle)
    #url_data = UrlData(5, 'myapp.com')
    #return HttpResponse(MaxUrlsToGather)

    print url_data.url_list()

    return render(request, 'myapp/scan/process_scan.html', {
        'url_data':url_data,'EnteredDomain':EnteredDomain,'getDomainLinksFromGoogle':getDomainLinksFromGoogle,
        'NumberOfUrlsFound':NumberOfUrlsFound,
        'getFirstPageLinks' : getFirstPageLinks,
    })

urldata.py

class UrlData(object):

def __init__(self, num_of_urls, url_pattern):
    self.num_of_urls = num_of_urls
    self.url_pattern = url_pattern


def url_list(self):
    # Returns a list of strings that represent the urls you want based on num_of_urls
    # e.g. asite.com/?search?start=10
    urls = []
    for i in xrange(self.num_of_urls):
        urls.append(self.url_pattern + '&start=' + str((i + 1) * 10) + ',')
    return urls

模板：

{{ getFirstPageLinks }}
    {% if url_data.num_of_urls > 0 %} 
        {% for url in url_data.url_list %}
            {{ url }}
        {% endfor %}
    {% endif %}

输出：

[<cite>www.google.com/webmasters/</cite>, <cite>www.domain.com</cite>, <cite>www.domain.comblog/</cite>, <cite>www.domain.comblog/projects/</cite>, <cite>www.domain.comblog/category/internet/</cite>, <cite>www.domain.comblog/category/goals/</cite>, <cite>www.domain.comblog/category/uncategorized/</cite>, <cite>www.domain.comblog/twit/2013/01/</cite>, <cite>www.domain.comblog/category/dog-2/</cite>, <cite>www.domain.comblog/category/goals/personal/</cite>, <cite>www.domain.comblog/category/internet/tech/</cite>]

由getFirstPageLinks

生成

和

https://www.google.com/search?q=site%3Adomain.com&start=10, https://www.google.com/search?q=site%3Adomain.com&start=20,

由url_data模板变量

生成

目前的问题是：我需要遍历url中的每个url_data并获取getFirstPageLinks之类的输出正在输出它。

我怎样才能做到这一点？

谢谢。

循环结果以收集网址

0 个答案: