所以我正在尝试高尔夫球场从一个给定的网站中提取数据,在该网站中它将创建一个包含名称和地址的CSV。对于地址,虽然我从中获取数据的网站已将标签分开。是否有可能将两个分开的地址解析为两个单独的列?
所以在HTML
上看起来像这样<div class="location">10799 E 550 S<br>Zionsville, Indiana, United States</div>
我希望它会被分解为
Column1:10799 E 550 S
Column2:Zionsville, Indiana, United States
这是我的代码:
import csv
import requests
from bs4 import BeautifulSoup
courses_list = []
with open('Garmin_GC.csv', 'w') as file:
writer = csv.writer(file)
for i in range(3): #893
url = "http://sites.garmin.com/clsearch/courses/search?course=&location=&country=US&state=&holes=&radius=&lang=en&search_submitted=1&per_page={}".format(
i * 20)
r = requests.get(url)
soup = BeautifulSoup(r.text)
g_data2 = soup.find_all("div", {"class": "result"})
for item in g_data2:
try:
name = item.find_all("div", {"class": "name"})[0].text
except IndexError:
name = ''
print "No Name found!"
try:
address = item.find_all("div", {"class": "location"})[0].get_text(separator=' ')
print address
except IndexError:
address = ''
print "No Address found!"
writer.writerow([name.encode("utf-8"), address.encode("utf-8")])
答案 0 :(得分:1)
使用.stripped_strings
generator:
address = list(item.find('div', class_='location').stripped_strings)
这将产生两个字符串的列表:
>>> from bs4 import BeautifulSoup
>>> markup = '''<div class="location">10799 E 550 S<br>Zionsville, Indiana, United States</div>'''
>>> soup = BeautifulSoup(markup)
>>> list(soup.find('div', class_='location').stripped_strings)
[u'10799 E 550 S', u'Zionsville, Indiana, United States']
将其放在代码的上下文中:
try:
name = item.find('div', class_='name').text
except AttributeError:
name = u''
try:
address = list(item.find('div', class_='location').stripped_strings)
except AttributeError:
address = [u'', u'']
writer.writerow([v.encode("utf-8") for v in [name] + address])
其中两个地址值写入两个单独的列。