解析URL beautifulsoup

时间:2017-08-02 18:03:53

标签: python url beautifulsoup

import requests
import csv
from bs4 import BeautifulSoup
page = requests.get("https://www.google.com/search?q=cars")
soup = BeautifulSoup(page.content, "lxml")
import re
links = soup.findAll("a")
with open('aaa.csv', 'wb') as myfile:
    for link in  soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")): 
        a = (re.split(":(?=http)",link["href"].replace("/url?q=","")))
        wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
        wr.writerow(a)

此代码的输出是我有一个CSV文件,其中保存了28个URL,但URL不正确。例如,这是一个错误的网址: -

http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A

相反它应该是: -

http://www.imdb.com/title/tt0317219/

如果每个网址包含"&sa="

,我该如何删除第二部分

因为那时URL的第二部分从: - 应删除"&sa=",以便像第二个网址一样保存所有网址。

我正在使用python 2.7和Ubuntu 16.04。

2 个答案:

答案 0 :(得分:4)

如果每次网址的多余部分都以&开头,您可以将split()应用于每个网址:

url = 'http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A'
url = url.split('&')[0]
print(url)

输出:

http://www.imdb.com/title/tt0317219/

答案 1 :(得分:1)

不是最好的方法,但你可以再分一次,在a:

之后再添加一行
a=[a[0].split("&")[0]]
print(a)

结果:

['https://de.wikipedia.org/wiki/Cars_(Film)']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:I2SHYtLktRcJ']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Handlung']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Synchronisation']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Soundtrack']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Kritik']
['https://www.mytoys.de/disney-cars/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:9Ohx4TRS8KAJ']
['https://www.youtube.com/watch%3Fv%3DtNmo09Q3F8s']
['https://www.youtube.com/watch%3Fv%3DtNmo09Q3F8s']
['https://www.youtube.com/watch%3Fv%3DkLAnVd5y7M4']
['https://www.youtube.com/watch%3Fv%3DkLAnVd5y7M4']
['http://cars.disney.com/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:1BoR6M9fXwcJ']
['http://cars.disney.com/']
['http://cars.disney.com/']
['https://www.whichcar.com.au/car-style/12-cartoon-cars']
['https://www.youtube.com/watch%3Fv%3D6JSMAbeUS-4']
['http://filme.disney.de/cars-3-evolution']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:fO7ypFFDGk0J']
['http://www.4players.de/4players.php/spielinfonews/Allgemein/36859/2169193/Project_CARS_2-Zehn_Ferraris_erweitern_den_virtuellen_Fuhrpark.html']
['http://www.4players.de/4players.php/spielinfonews/Allgemein/36859/2169193/Project_CARS_2-Zehn_Ferraris_erweitern_den_virtuellen_Fuhrpark.html']
['http://www.play3.de/2017/08/02/project-cars-2-6/']
['http://www.imdb.com/title/tt0317219/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:-xdXy-yX2fMJ']
['http://www.carmagazine.co.uk/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:PRPbHf_kD9AJ']
['http://google.com/search%3Ftbm%3Disch%26q%3DCars']
['http://www.imdb.com/title/tt0317219/']
['https://de.wikipedia.org/wiki/Cars_(Film)']