我正在建立一个网络刮板来抓取多个网站,所以我不必直接访问该网站。
目前我遇到了重复网址的问题,该脚本按我的意愿行事,但链接正在重复,我不想要这样做。
这是我的代码:
def HackerNews():
hackerNews = ['https://www.darkreading.com/attacks-breaches.asp','https://www.darkreading.com/application-security.asp',
'https://www.darkreading.com/vulnerabilities-threats.asp', 'https://www.darkreading.com/endpoint-security.asp',
'https://www.darkreading.com/IoT.asp','https://www.darkreading.com/vulnerabilities-threats.asp'
]
keywords = ["bitcoin", "bit", "BTC", "Bit", "Security","Attack", "Breach","Cyber",
"Ransomware","Botnet","Worm","Hacked","Hack","Hackers","Flaw", "Risk","Danger" ]
for link in hackerNews:
request = urllib2.Request(link)
request.add_header('User-Agent', 'Mozilla 5.0')
websitecontent = urllib2.urlopen(request).read()
soup = BeautifulSoup(websitecontent, 'html.parser')
headers = soup.findAll('header', {'class' : 'strong medium'})
for h in headers:
a = h.find("a")
for keyword in keywords:
if keyword in a["title"]:
print("Title: " + a["title"] + " \nLink: " "https://darkreading.com" + a["href"])
HackerNews()
这是一个输出示例:
标题:Android Ransomware套件在黑暗网络中崛起 链接:https://darkreading.com/mobile/android-ransomware-kits-on-the-rise-in-the-dark-web-/d/d-id/1330591
标题:比特币矿工NiceHash被黑客攻击,可能在比特币上损失了6200万美元 链接:https://darkreading.com/cloud/bitcoin-miner-nicehash-hacked-possibly-losing-价值6200万美元的比特币/ d / d-id / 1330585
标题:比特币矿工NiceHash被黑客攻击,可能在比特币上损失了6200万美元 链接:https://darkreading.com/cloud/bitcoin-miner-nicehash-hacked-possibly-losing-价值6200万美元的比特币/ d / d-id / 1330585
标题:比特币矿工NiceHash被黑客攻击,可能在比特币上损失了6200万美元 链接:https://darkreading.com/cloud/bitcoin-miner-nicehash-hacked-possibly-losing-价值6200万美元的比特币/ d / d-id / 1330585
标题:优步使用$ 100K Bug Bounty付款,沉默佛罗里达黑客:报告 链接:https://darkreading.com/attacks-breaches/uber-used- $ 100k-bug-bounty-to-pay-silence-florida-hacker-report / d / d-id / 1330584
答案 0 :(得分:1)
好吧,您可以制作包含所有链接的字典,而不是直接打印。或者如果你想把它保存在一个学生列表中。 在追加之前,您可以检查它是否已经在列表中。
def HackerNews():
hackerNews = ['https://www.darkreading.com/attacks-breaches.asp','https://www.darkreading.com/application-security.asp',
'https://www.darkreading.com/vulnerabilities-threats.asp', 'https://www.darkreading.com/endpoint-security.asp',
'https://www.darkreading.com/IoT.asp','https://www.darkreading.com/vulnerabilities-threats.asp'
]
keywords = ["bitcoin", "bit", "BTC", "Bit", "Security","Attack", "Breach","Cyber",
"Ransomware","Botnet","Worm","Hacked","Hack","Hackers","Flaw", "Risk","Danger" ]
output = []
for link in hackerNews:
request = urllib2.Request(link)
request.add_header('User-Agent', 'Mozilla 5.0')
websitecontent = urllib2.urlopen(request).read()
soup = BeautifulSoup(websitecontent, 'html.parser')
headers = soup.findAll('header', {'class' : 'strong medium'})
for h in headers:
a = h.find("a")
for keyword in keywords:
if keyword in a["title"]:
if (a["title"], a["href"]) not in output:
output.append((a["title"], a["href"]))
for link in output:
print("Title: " + link[0] + " \nLink: " "https://darkreading.com" + link[1])
HackerNews()
没有修复你的缩进问题,也没有测试它。但它应该传达我的观点:)
编辑:为python 3工作:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
def HackerNews():
hackerNews = ['https://www.darkreading.com/attacks-breaches.asp','https://www.darkreading.com/application-security.asp',
'https://www.darkreading.com/vulnerabilities-threats.asp', 'https://www.darkreading.com/endpoint-security.asp',
'https://www.darkreading.com/IoT.asp','https://www.darkreading.com/vulnerabilities-threats.asp'
]
keywords = ["bitcoin", "bit", "BTC", "Bit", "Security","Attack", "Breach","Cyber",
"Ransomware","Botnet","Worm","Hacked","Hack","Hackers","Flaw", "Risk","Danger" ]
output = []
for link in hackerNews:
request = Request(link)
request.add_header('User-Agent', 'Mozilla 5.0')
websitecontent = urlopen(request).read()
soup = BeautifulSoup(websitecontent, 'html.parser')
headers = soup.findAll('header', {'class' : 'strong medium'})
for h in headers:
a = h.find("a")
for keyword in keywords:
if keyword in a["title"]:
if (a["title"], a["href"]) not in output:
output.append((a["title"], a["href"]))
for link in output:
print("Title: " + link[0] + " \nLink: " "https://darkreading.com" + link[1])
HackerNews()