Question

我尝试使用美丽的汤提取谷歌搜索结果。但是提取的网址是这种形式：

/url?q=https://www.facebook.com/PMOIndia/&sa=U&ved=2ahUKEwiU89Xr_MjwAhUAHLkGHfl3AFI4KBAWMAF6BAgIEAE&usg=AOvVaw3WXSVzoiXCQOliyGZxjkSd

我只想要网址的“https://www.facebook.com/PMOIndia/”部分。

我使用的代码是

    page="https://www.google.com/search?q="+str(query)+"&sxsrf=ALeKk01EudGSzSmaU8dDy9kgRgdOqE_UMQ:1620987283855&ei=k02eYLW6M7ud4-EPvNyM0Ag&start="+str(page)+"&sa=N&ved=2ahUKEwj1z_SZ-MjwAhW7zjgGHTwuA4oQ8tMDegQIARA3&biw=1536&bih=722"
driver = requests.get(page) 
sleep(randint(2,10))
soup= BeautifulSoup(driver.text, 'html.parser')
for path in soup.findAll('div', attrs={'class':'kCrYT'}):
    x =path.find('a')
    try:
        urls.append(x.get('href'))
    except AttributeError :
        pass

Answer 1

试试：

url = "/url?q=https://www.facebook.com/PMOIndia/&sa=U&ved=2ahUKEwiU89Xr_MjwAhUAHLkGHfl3AFI4KBAWMAF6BAgIEAE&usg=AOvVaw3WXSVzoiXCQOliyGZxjkSd
"
new_url = "/".join(url[7:].split("/",4)[:4])+"/"

谷歌搜索结果提取python

1 个答案: