如何从Urls列表中提取数据以进行网络抓取

时间:2020-08-07 14:56:41

标签: web-scraping beautifulsoup python-requests

我是Web抓取的新手,我想从通过网址访问的<div>标记中提取坐标。有一个URL列表,我要从中提取坐标并将其保存在CSV文件中。

 <div class="single-view-data-row">
 <div class="single-view-data-title">Coordinates</div>
 <div class="single-view-data-get">
                                 17.009164 N, -90.309259 E<br/><a href="http://geographiclib.sourceforge.net/cgi-bin/GeoConvert?input=17.009164+-90.309259" target="_blank">»» UTM / MGRS</a></div></div></div>

感谢帮助!

1 个答案:

答案 0 :(得分:0)

要从此HTML文本中提取链接和坐标,可以使用以下脚本:

from bs4 import BeautifulSoup

txt = ''' <div class="single-view-data-row">
 <div class="single-view-data-title">Coordinates</div>
 <div class="single-view-data-get">
                                 17.009164 N, -90.309259 E<br/><a href="http://geographiclib.sourceforge.net/cgi-bin/GeoConvert?input=17.009164+-90.309259" target="_blank">»» UTM / MGRS</a></div></div></div>
'''

soup = BeautifulSoup(txt, 'html.parser')

link = soup.select_one('.single-view-data-get a')['href']
coords = soup.select_one('.single-view-data-get').find_next(text=True).split(',')

print(link)
print(coords[0].strip())
print(coords[1].strip())

打印:

http://geographiclib.sourceforge.net/cgi-bin/GeoConvert?input=17.009164+-90.309259
17.009164 N
-90.309259 E