我正在尝试解析html结果,抓取几个网址,然后解析访问这些网址的输出。
我正在使用django 1.5 / python 2.7:
views.py
#mechanize/beautifulsoup config options here.
beautifulSoupObj = BeautifulSoup(mechanizeBrowser.response().read()) #read the raw response
getFirstPageLinks = beautifulSoupObj.find_all('cite') #get first page of urls
url_data = UrlData(NumberOfUrlsFound, getDomainLinksFromGoogle)
#url_data = UrlData(5, 'myapp.com')
#return HttpResponse(MaxUrlsToGather)
print url_data.url_list()
return render(request, 'myapp/scan/process_scan.html', {
'url_data':url_data,'EnteredDomain':EnteredDomain,'getDomainLinksFromGoogle':getDomainLinksFromGoogle,
'NumberOfUrlsFound':NumberOfUrlsFound,
'getFirstPageLinks' : getFirstPageLinks,
})
urldata.py
class UrlData(object):
def __init__(self, num_of_urls, url_pattern):
self.num_of_urls = num_of_urls
self.url_pattern = url_pattern
def url_list(self):
# Returns a list of strings that represent the urls you want based on num_of_urls
# e.g. asite.com/?search?start=10
urls = []
for i in xrange(self.num_of_urls):
urls.append(self.url_pattern + '&start=' + str((i + 1) * 10) + ',')
return urls
模板:
{{ getFirstPageLinks }}
{% if url_data.num_of_urls > 0 %}
{% for url in url_data.url_list %}
{{ url }}
{% endfor %}
{% endif %}
输出:
[<cite>www.google.com/webmasters/</cite>, <cite>www.domain.com</cite>, <cite>www.domain.comblog/</cite>, <cite>www.domain.comblog/projects/</cite>, <cite>www.domain.comblog/category/internet/</cite>, <cite>www.domain.comblog/category/goals/</cite>, <cite>www.domain.comblog/category/uncategorized/</cite>, <cite>www.domain.comblog/twit/2013/01/</cite>, <cite>www.domain.comblog/category/dog-2/</cite>, <cite>www.domain.comblog/category/goals/personal/</cite>, <cite>www.domain.comblog/category/internet/tech/</cite>]
由getFirstPageLinks
和
https://www.google.com/search?q=site%3Adomain.com&start=10, https://www.google.com/search?q=site%3Adomain.com&start=20,
由url_data
模板变量
目前的问题是:我需要遍历url
中的每个url_data
并获取getFirstPageLinks
之类的输出正在输出它。
我怎样才能做到这一点?
谢谢。