Python BeautifulSoup提取包含图像和文本的html表格单元格

时间:2017-07-25 17:04:46

标签: python web-scraping beautifulsoup

我想从URL中提取一个表,但是迷路了......看看我在下面做了什么:

url = "https://www.marinetraffic.com/en/ais/index/ports/all/per_page:50"

headers = {'User-agent': 'Mozilla/5.0'}
raw_html = requests.get(url, headers=headers)

raw_data = raw_html.text
soup_data = BeautifulSoup(raw_data, "lxml")

td = soup_data.findAll('tr')[1:]

country = []

for data in td:
    col = data.find_all('td')
    country.append(col)

如何获取某些列的文本和URL(国家/地区,端口名称,UN / LOCODE,类型和端口地图)?

1 个答案:

答案 0 :(得分:1)

我为你做了一些刮刮。您可以使用具有键值的字典作为表标题,如下所示。您可以遍历各个td以获取所需的列,然后使用.text获取文本的url,src,href等和url = "https://www.marinetraffic.com/en/ais/index/ports/all/per_page:50" headers = {'User-agent': 'Mozilla/5.0'} raw_html = requests.get(url, headers=headers) raw_data = raw_html.text soup_data = BeautifulSoup(raw_data, "lxml") td = soup_data.findAll('tr')[1:] country = [] for data in td: col = data.find_all('td') details = {} for i,col in enumerate(col): if i == 0: details['Img-src'] = ("https://www.marinetraffic.com"+col.find('img')['src']) if i == 1: details["Port_name"] = (col.text.replace('\n','')) if i == 2: details['UN/LOCODE'] = (col.text.replace('\r\n','').replace(" ","")) if i == 4: details['type'] = (col.text.replace('\r\n','').replace(" ","")) if i == 5: details['map_url'] = ("https://www.marinetraffic.com"+(col.find('a')['href'])) country.append(details) 。希望这可以帮助。

<div id="ember3366" class="ember-view">
<div class="row m-b-1">
<!---->
<div class="col-xs-12 col-md-6 col-lg-3 m-b-1">
<label>Category</label>
<select class="form-control">
<option value="All">
<option value="Spirits">Spirits</option>
<option value="Wine">Wine</option>
</select>
</div>

输出:

[{'Img-src': 'https://www.marinetraffic.com/img/flags/png40/CN.png',
  'Port_name': 'SHANGHAI',
  'UN/LOCODE': 'CNSHA',
  'map_url': 'https://www.marinetraffic.com/en/ais/home/zoom:9/centerx:121.614746/centery:31.3663635/showports:true/portid:1253',
  'type': 'Port'},
 {'Img-src': 'https://www.marinetraffic.com/img/flags/png40/CN.png',
  'Port_name': 'MAANSHAN',
  'UN/LOCODE': 'CNMAA',
  'map_url': 'https://www.marinetraffic.com/en/ais/home/zoom:14/centerx:118.459503/centery:31.7180004/showports:true/portid:2746',
  'type': 'Port'},
 {'Img-src': 'https://www.marinetraffic.com/img/flags/png40/HK.png',
  'Port_name': 'HONG KONG',
  'UN/LOCODE': 'HKHKG',
  'map_url': 'https://www.marinetraffic.com/en/ais/home/zoom:14/centerx:114.181366/centery:22.2879486/showports:true/portid:2429',
  'type': 'Port'}, 
  ...
  ]