所以我用Python和它的一些库做了一个web scraper ...它进入给定的站点并从该站点的链接获取所有链接和文本。我已经过滤了结果,因此我只在该网站上打印外部链接。
代码如下所示:
import urllib
import re
import mechanize
from bs4 import BeautifulSoup
import urlparse
import cookielib
from urlparse import urlsplit
from publicsuffix import PublicSuffixList
link = "http://www.ananda-pur.de/23.html"
newesturlDict = {}
baseAdrInsArray = []
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(link, timeout=10)
for linkins in br.links():
newesturl = urlparse.urljoin(linkins.base_url, linkins.url)
linkTxt = linkins.text
baseAdrIns = linkins.base_url
if baseAdrIns not in baseAdrInsArray:
baseAdrInsArray.append(baseAdrIns)
netLocation = urlsplit(baseAdrIns)
psl = PublicSuffixList()
publicAddress = psl.get_public_suffix(netLocation.netloc)
if publicAddress not in newesturl:
if newesturl not in newesturlDict:
newesturlDict[newesturl,linkTxt] = 1
if newesturl in newesturlDict:
newesturlDict[newesturl,linkTxt] += 1
newesturlCount = sorted(newesturlDict.items(),key=lambda(k,v):(v,k),reverse=True)
for newesturlC in newesturlCount:
print baseAdrInsArray[0]," - ",newesturlC[0],"- count: ", newesturlC[1]
并打印出如下结果:
http://www.ananda-pur.de/23.html - ('http://www.yogibhajan.com/', 'http://www.yogibhajan.com') - count: 1
http://www.ananda-pur.de/23.html - ('http://www.kundalini-yoga-zentrum-berlin.de/', 'http://www.kundalini-yoga-zentrum-berlin.de') - count: 1
http://www.ananda-pur.de/23.html - ('http://www.kriteachings.org/', 'http://www.sat-nam-rasayan.de') - count: 1
http://www.ananda-pur.de/23.html - ('http://www.kriteachings.org/', 'http://www.kriteachings.org') - count: 1
http://www.ananda-pur.de/23.html - ('http://www.kriteachings.org/', 'http://www.gurudevsnr.com') - count: 1
http://www.ananda-pur.de/23.html - ('http://www.kriteachings.org/', 'http://www.3ho.de') - count: 1
我的问题是那些具有不同文本的链接。根据打印示例,给定网站有4个链接http://www.kriteachings.org/
,但正如您所看到的,这4个链接中的每一个都有不同的text
:第一个是http://www.sat-nam-rasayan.de
,第二个是http://www.kriteachings.org
,第3位是http://www.gurudevsnr.com
,第4位是http://www.3ho.de
我想得到打印结果,其中我可以看到链接在给定页面上有多少时间,但如果有不同的链接文本,它只是附加到同一链接的其他文本。为了达到这个例子的目的,我想得到这样的印刷品:
http://www.ananda-pur.de/23.html - http://www.yogibhajan.com/ - http://www.yogibhajan.com - count: 1
http://www.ananda-pur.de/23.html - http://www.kundalini-yoga-zentrum-berlin.de - http://www.kundalini-yoga-zentrum-berlin.de - count: 1
http://www.ananda-pur.de/23.html - http://www.kriteachings.org/ - http://www.sat-nam-rasayan.de, http://www.kriteachings.org, http://www.gurudevsnr.com, http://www.3ho.de - count: 4
说明:
(第一个链接是给定页面,第二个是创建链接,第三个链接是 该创建链接的实际文本,第4项是多少次 该链接出现在给定网站上)
我的主要问题是我不知道如何比较?!,排序?!或告诉程序这是相同的链接,并且它应该附加不同的文本。
如果没有太多代码,这样的事情是否可能?我是python nooby所以我有点迷失..
欢迎任何帮助或建议
答案 0 :(得分:1)
将链接收集到字典中,收集链接文本并处理计数:
import cookielib
import mechanize
base_url = "http://www.ananda-pur.de/23.html"
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(base_url, timeout=10)
links = {}
for link in br.links():
if link.url not in links:
links[link.url] = {'count': 1, 'texts': [link.text]}
else:
links[link.url]['count'] += 1
links[link.url]['texts'].append(link.text)
# printing
for link, data in links.iteritems():
print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])
打印:
http://www.ananda-pur.de/23.html - index.html - Zadekstr 11,12351 Berlin, - 2
http://www.ananda-pur.de/23.html - 28.html - Das Team - 1
http://www.ananda-pur.de/23.html - http://www.yogibhajan.com/ - http://www.yogibhajan.com - 1
http://www.ananda-pur.de/23.html - 24.html - Kontakt - 1
http://www.ananda-pur.de/23.html - 25.html - Impressum - 1
http://www.ananda-pur.de/23.html - http://www.kriteachings.org/ - http://www.kriteachings.org,http://www.gurudevsnr.com,http://www.sat-nam-rasayan.de,http://www.3ho.de - 4
http://www.ananda-pur.de/23.html - http://www.kundalini-yoga-zentrum-berlin.de/ - http://www.kundalini-yoga-zentrum-berlin.de - 1
http://www.ananda-pur.de/23.html - 3.html - Ergo Oranien 155 - 1
http://www.ananda-pur.de/23.html - 2.html - Physio Bänsch 36 - 1
http://www.ananda-pur.de/23.html - 13.html - Stellenangebote - 1
http://www.ananda-pur.de/23.html - 23.html - Links - 1