我绝对是使用Python进行Web爬网的初学者,并且对Python编程几乎一无所知。我只是想提取田纳西州律师的信息。在网页中,有多个链接,其中有更多链接,而其中还有各种律师。
如果可以的话,您可以告诉我应该遵循的步骤。
在提取第一页上的链接之前,我已经做完了,但是我只需要城市链接,而我获得的所有链接都带有href
标签。现在,我该如何迭代它们并进一步进行?
from bs4 import BeautifulSoup as bs
import pandas as pd
res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')
links = [item['href'] for item in soup.select('a')]
print(links)```
It is printing
````C:\Users\laptop\AppData\Local\Programs\Python\Python36-32\python.exe C:/Users/laptop/.PyCharmCE2017.1/config/scratches/scratch_1.py
['https://www.superlawyers.com', 'https://attorneys.superlawyers.com', 'https://ask.superlawyers.com', 'https://video.superlawyers.com',.... ````
All the links are extracted whereas I only need the links of the cities. Kindly help.
答案 0 :(得分:2)
没有正则表达式:
cities = soup.find('div', class_="three_browse_columns" )
for city in cities.find_all('a'):
print(city['href'])
答案 1 :(得分:1)
使用正则表达式re
并搜索城市的href值。
from bs4 import BeautifulSoup as bs
import re
res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')
links = [item['href'] for item in soup.find_all('a',href=re.compile('https://attorneys.superlawyers.com/tennessee/'))]
print(links)
['https://attorneys.superlawyers.com/tennessee/alamo/', 'https://attorneys.superlawyers.com/tennessee/bartlett/', 'https://attorneys.superlawyers.com/tennessee/brentwood/', 'https://attorneys.superlawyers.com/tennessee/bristol/', 'https://attorneys.superlawyers.com/tennessee/chattanooga/', 'https://attorneys.superlawyers.com/tennessee/clarksville/', 'https://attorneys.superlawyers.com/tennessee/cleveland/', 'https://attorneys.superlawyers.com/tennessee/clinton/', 'https://attorneys.superlawyers.com/tennessee/columbia/', 'https://attorneys.superlawyers.com/tennessee/cookeville/', 'https://attorneys.superlawyers.com/tennessee/cordova/', 'https://attorneys.superlawyers.com/tennessee/covington/', 'https://attorneys.superlawyers.com/tennessee/dayton/', 'https://attorneys.superlawyers.com/tennessee/dickson/', 'https://attorneys.superlawyers.com/tennessee/dyersburg/', 'https://attorneys.superlawyers.com/tennessee/elizabethton/', 'https://attorneys.superlawyers.com/tennessee/franklin/', 'https://attorneys.superlawyers.com/tennessee/gallatin/', 'https://attorneys.superlawyers.com/tennessee/germantown/', 'https://attorneys.superlawyers.com/tennessee/goodlettsville/', 'https://attorneys.superlawyers.com/tennessee/greeneville/', 'https://attorneys.superlawyers.com/tennessee/henderson/', 'https://attorneys.superlawyers.com/tennessee/hendersonville/', 'https://attorneys.superlawyers.com/tennessee/hixson/', 'https://attorneys.superlawyers.com/tennessee/huntingdon/', 'https://attorneys.superlawyers.com/tennessee/huntsville/', 'https://attorneys.superlawyers.com/tennessee/jacksboro/', 'https://attorneys.superlawyers.com/tennessee/jackson/', 'https://attorneys.superlawyers.com/tennessee/jasper/', 'https://attorneys.superlawyers.com/tennessee/johnson-city/', 'https://attorneys.superlawyers.com/tennessee/kingsport/', 'https://attorneys.superlawyers.com/tennessee/knoxville/', 'https://attorneys.superlawyers.com/tennessee/la-follette/', 'https://attorneys.superlawyers.com/tennessee/lafayette/', 'https://attorneys.superlawyers.com/tennessee/lafollette/', 'https://attorneys.superlawyers.com/tennessee/lawrenceburg/', 'https://attorneys.superlawyers.com/tennessee/lebanon/', 'https://attorneys.superlawyers.com/tennessee/lenoir-city/', 'https://attorneys.superlawyers.com/tennessee/lewisburg/', 'https://attorneys.superlawyers.com/tennessee/lexington/', 'https://attorneys.superlawyers.com/tennessee/madisonville/', 'https://attorneys.superlawyers.com/tennessee/manchester/', 'https://attorneys.superlawyers.com/tennessee/maryville/', 'https://attorneys.superlawyers.com/tennessee/memphis/', 'https://attorneys.superlawyers.com/tennessee/millington/', 'https://attorneys.superlawyers.com/tennessee/morristown/', 'https://attorneys.superlawyers.com/tennessee/murfreesboro/', 'https://attorneys.superlawyers.com/tennessee/nashville/', 'https://attorneys.superlawyers.com/tennessee/paris/', 'https://attorneys.superlawyers.com/tennessee/pleasant-view/', 'https://attorneys.superlawyers.com/tennessee/pulaski/', 'https://attorneys.superlawyers.com/tennessee/rogersville/', 'https://attorneys.superlawyers.com/tennessee/sevierville/', 'https://attorneys.superlawyers.com/tennessee/sewanee/', 'https://attorneys.superlawyers.com/tennessee/shelbyville/', 'https://attorneys.superlawyers.com/tennessee/somerville/', 'https://attorneys.superlawyers.com/tennessee/spring-hill/', 'https://attorneys.superlawyers.com/tennessee/springfield/', 'https://attorneys.superlawyers.com/tennessee/tullahoma/', 'https://attorneys.superlawyers.com/tennessee/white-house/', 'https://attorneys.superlawyers.com/tennessee/winchester/', 'https://attorneys.superlawyers.com/tennessee/woodlawn/']
如果要使用CSS选择器,请使用以下代码。
from bs4 import BeautifulSoup as bs
import requests
res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')
links = [item['href'] for item in soup.select('a[href^="https://attorneys.superlawyers.com/tennessee"]')]
print(links)
['https://attorneys.superlawyers.com/tennessee/alamo/', 'https://attorneys.superlawyers.com/tennessee/bartlett/', 'https://attorneys.superlawyers.com/tennessee/brentwood/', 'https://attorneys.superlawyers.com/tennessee/bristol/', 'https://attorneys.superlawyers.com/tennessee/chattanooga/', 'https://attorneys.superlawyers.com/tennessee/clarksville/', 'https://attorneys.superlawyers.com/tennessee/cleveland/', 'https://attorneys.superlawyers.com/tennessee/clinton/', 'https://attorneys.superlawyers.com/tennessee/columbia/', 'https://attorneys.superlawyers.com/tennessee/cookeville/', 'https://attorneys.superlawyers.com/tennessee/cordova/', 'https://attorneys.superlawyers.com/tennessee/covington/', 'https://attorneys.superlawyers.com/tennessee/dayton/', 'https://attorneys.superlawyers.com/tennessee/dickson/', 'https://attorneys.superlawyers.com/tennessee/dyersburg/', 'https://attorneys.superlawyers.com/tennessee/elizabethton/', 'https://attorneys.superlawyers.com/tennessee/franklin/', 'https://attorneys.superlawyers.com/tennessee/gallatin/', 'https://attorneys.superlawyers.com/tennessee/germantown/', 'https://attorneys.superlawyers.com/tennessee/goodlettsville/', 'https://attorneys.superlawyers.com/tennessee/greeneville/', 'https://attorneys.superlawyers.com/tennessee/henderson/', 'https://attorneys.superlawyers.com/tennessee/hendersonville/', 'https://attorneys.superlawyers.com/tennessee/hixson/', 'https://attorneys.superlawyers.com/tennessee/huntingdon/', 'https://attorneys.superlawyers.com/tennessee/huntsville/', 'https://attorneys.superlawyers.com/tennessee/jacksboro/', 'https://attorneys.superlawyers.com/tennessee/jackson/', 'https://attorneys.superlawyers.com/tennessee/jasper/', 'https://attorneys.superlawyers.com/tennessee/johnson-city/', 'https://attorneys.superlawyers.com/tennessee/kingsport/', 'https://attorneys.superlawyers.com/tennessee/knoxville/', 'https://attorneys.superlawyers.com/tennessee/la-follette/', 'https://attorneys.superlawyers.com/tennessee/lafayette/', 'https://attorneys.superlawyers.com/tennessee/lafollette/', 'https://attorneys.superlawyers.com/tennessee/lawrenceburg/', 'https://attorneys.superlawyers.com/tennessee/lebanon/', 'https://attorneys.superlawyers.com/tennessee/lenoir-city/', 'https://attorneys.superlawyers.com/tennessee/lewisburg/', 'https://attorneys.superlawyers.com/tennessee/lexington/', 'https://attorneys.superlawyers.com/tennessee/madisonville/', 'https://attorneys.superlawyers.com/tennessee/manchester/', 'https://attorneys.superlawyers.com/tennessee/maryville/', 'https://attorneys.superlawyers.com/tennessee/memphis/', 'https://attorneys.superlawyers.com/tennessee/millington/', 'https://attorneys.superlawyers.com/tennessee/morristown/', 'https://attorneys.superlawyers.com/tennessee/murfreesboro/', 'https://attorneys.superlawyers.com/tennessee/nashville/', 'https://attorneys.superlawyers.com/tennessee/paris/', 'https://attorneys.superlawyers.com/tennessee/pleasant-view/', 'https://attorneys.superlawyers.com/tennessee/pulaski/', 'https://attorneys.superlawyers.com/tennessee/rogersville/', 'https://attorneys.superlawyers.com/tennessee/sevierville/', 'https://attorneys.superlawyers.com/tennessee/sewanee/', 'https://attorneys.superlawyers.com/tennessee/shelbyville/', 'https://attorneys.superlawyers.com/tennessee/somerville/', 'https://attorneys.superlawyers.com/tennessee/spring-hill/', 'https://attorneys.superlawyers.com/tennessee/springfield/', 'https://attorneys.superlawyers.com/tennessee/tullahoma/', 'https://attorneys.superlawyers.com/tennessee/white-house/', 'https://attorneys.superlawyers.com/tennessee/winchester/', 'https://attorneys.superlawyers.com/tennessee/woodlawn/']
答案 2 :(得分:1)
使用更快的方法来使用父ID,然后在其中选择a
标签
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://attorneys.superlawyers.com/tennessee/')
soup = bs(r.content, 'lxml')
cities = [item['href'] for item in soup.select('#browse_view a')]