这是我的一些html源代码:
<div class="s">
<div class="th N3nEGc" style="height:48px;width:61px">
<a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&imgrefurl=https://linuxhint.com/setup_screensaver_manjaro_linux/&h=912&w=1140&tbnid=10DzCgmImE0jM&tbnh=201&tbnw=251&usg=K_YJsquLr4rorhW2ks8UdceQ8uKjg=&docid=0vImrzSjsr5zQM"
data-ved="2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ"
ping="/urlsa=t&source=web&rct=j&url=/imgres%3Fimgurl%3Dhttps://linuxhint.com/wpcontent/uploads/2018/12/11.jpg%26imgrefurl%3Dhttps://linuxhint.com/setup_screensaver_manjaro_linux/%26h%3D912%26w%3D1140%26tbnid%3D10DzCgmImE0jM%26tbnh%3D201%26tbnw%3D251%26usg%3DK_YJsquLr4rorhW2ks8UdceQ8uKjg%3D%26docid%3D0vImrzSjsr5zQM&ved=2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ">
</a>
</div>
</div>
我要提取的是链接:
<a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&
所以输出将是这样,
https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg
我使用python尝试过的是:
sourceCode = opener.open(googlePath).read().decode('utf-8')
links = re.findall('href="/imgres?imgurl=(.*?)jpg&imgrefurl="',sourceCode)
for i in links:
print(i)
答案 0 :(得分:2)
比通过regex
解析查询字符串更好的方法是使用parse_qs
函数(更安全,不用regex
摆弄就可以得到想要的东西)(doc):>
data = '''<div class="s"><div class="th N3nEGc" style="height:48px;width:61px"><a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&imgrefurl=https://linuxhint.com/setup_screensaver_manjaro_linux/&h=912&w=1140&tbnid=10DzCgmImE0jM&tbnh=201&tbnw=251&usg=K_YJsquLr4rorhW2ks8UdceQ8uKjg=&docid=0vImrzSjsr5zQM" data-ved="2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ" ping="/urlsa=t&source=web&rct=j&url=/imgres%3Fimgurl%3Dhttps://linuxhint.com/wpcontent/uploads/2018/12/11.jpg%26imgrefurl%3Dhttps://linuxhint.com/setup_screensaver_manjaro_linux/%26h%3D912%26w%3D1140%26tbnid%3D10DzCgmImE0jM%26tbnh%3D201%26tbnw%3D251%26usg%3DK_YJsquLr4rorhW2ks8UdceQ8uKjg%3D%26docid%3D0vImrzSjsr5zQM&ved=2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ">'''
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs
soup = BeautifulSoup(data, 'lxml')
d = urlparse(soup.select_one('a[href*="imgurl"]')['href'])
q = parse_qs(d.query)
print(q['imgurl'])
打印:
['https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg']
答案 1 :(得分:0)
如果问题出在您的正则表达式上,那么我想您可以尝试以下一种方法:
link = re.search('^https?:\/\/.*[\r\n]*[^.\\,:;]', sourceCode)
link = link.group()
print (link)
答案 2 :(得分:0)
也许您应该为'?'添加一个转义字符,试试看:
links = re.findall('href="/imgres\?imgurl=(.*?)jpg&imgrefurl="',sourceCode)
for i in links:
print(i)