我试图从Google搜索(仅前10个)中收集链接和链接文本,这是我的代码:
import requests
from lxml import html
import time
import re
headers={'User-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763'}
sentence = "hello world"
url = 'https://google.com/search?q={}'.format(sentence)
res= requests.get(url, headers=headers)
tree= html.fromstring(res.text)
li = tree.xpath("//a[@href]")
y = [link for link in li if link.get('href').startswith(("https://", "http://")) if "google" not in link.get('href')][:10]
for i in y:
print("{}:\t{}".format(i.text_content(), i.get('href')))
这是输出:
10
1:56hello world: https://www.youtube.com/watch?v=Yw6u6YkTgQ4
4:23BUMP OF CHICKEN「Hello,world!」: https://www.youtube.com/watch?v=rOU4YiuaxAM
5:24Lady Antebellum - Hello World: https://www.youtube.com/watch?v=al2DFQEZl4M
"Hello, World!" program - Wikipediahttps://en.wikipedia.org/wiki/%22Hello,_World!%22_program: https://en.wikipedia.org/wiki/%22Hello,_World!%22_program
Hello World (disambiguation): https://en.wikipedia.org/wiki/Hello_World_(disambiguation)
Sanity check: https://en.wikipedia.org/wiki/Sanity_check
Just another Perl hacker: https://en.wikipedia.org/wiki/Just_another_Perl_hacker
Hello, World! - Learn Python - Free Interactive Python Tutorialhttps://www.learnpython.org/en/Hello,_World!: https://www.learnpython.org/en/Hello,_World!
Hello World Kids: HWKhelloworldkids.org/: http://helloworldkids.org/
About Us: http://helloworldkids.org/about-us/
该列表是正确的,但是,有时print
时我会得到重复的链接,如何从输出中删除重复的链接
答案 0 :(得分:0)
您可以使用此代码,我对您的代码进行了一些更改,它将起作用
import requests
from lxml import html
import time
import re
headers={'User-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763'}
sentence = "hello world"
url = 'https://google.com/search?q={}'.format(sentence)
res= requests.get(url, headers=headers)
tree= html.fromstring(res.text)
li = tree.xpath("//a[@href]")
y = [link for link in li if link.get('href').startswith(("https://", "http://")) if
"google" not in link.get('href')][:10]
links=[]
for i in y:
#print("{}:\t{}".format(i.text_content(), i.get('href')))
if (i.get('href')) not in links:
links.append( i.get('href') )
for l in links:
print(l)
列表“链接”将仅包含不同的链接