我尝试使用美丽的汤提取谷歌搜索结果。 但是提取的网址是这种形式:
/url?q=https://www.facebook.com/PMOIndia/&sa=U&ved=2ahUKEwiU89Xr_MjwAhUAHLkGHfl3AFI4KBAWMAF6BAgIEAE&usg=AOvVaw3WXSVzoiXCQOliyGZxjkSd
我只想要网址的“https://www.facebook.com/PMOIndia/”部分。
我使用的代码是
page="https://www.google.com/search?q="+str(query)+"&sxsrf=ALeKk01EudGSzSmaU8dDy9kgRgdOqE_UMQ:1620987283855&ei=k02eYLW6M7ud4-EPvNyM0Ag&start="+str(page)+"&sa=N&ved=2ahUKEwj1z_SZ-MjwAhW7zjgGHTwuA4oQ8tMDegQIARA3&biw=1536&bih=722"
driver = requests.get(page)
sleep(randint(2,10))
soup= BeautifulSoup(driver.text, 'html.parser')
for path in soup.findAll('div', attrs={'class':'kCrYT'}):
x =path.find('a')
try:
urls.append(x.get('href'))
except AttributeError :
pass
答案 0 :(得分:0)
试试:
url = "/url?q=https://www.facebook.com/PMOIndia/&sa=U&ved=2ahUKEwiU89Xr_MjwAhUAHLkGHfl3AFI4KBAWMAF6BAgIEAE&usg=AOvVaw3WXSVzoiXCQOliyGZxjkSd
"
new_url = "/".join(url[7:].split("/",4)[:4])+"/"