<br/>使用python和beautifulsoup进行标记解析

时间:2015-07-09 20:48:19

标签: python beautifulsoup

所以我正在尝试高尔夫球场从一个给定的网站中提取数据,在该网站中它将创建一个包含名称和地址的CSV。对于地址,虽然我从中获取数据的网站已将标签分开。是否有可能将两个分开的地址解析为两个单独的列?

所以在HTML

上看起来像这样
<div class="location">10799 E 550 S<br>Zionsville, Indiana, United States</div>

我希望它会被分解为

Column1:10799 E 550 S
Column2:Zionsville, Indiana, United States

这是我的代码:

import csv
import requests
from bs4 import BeautifulSoup

courses_list = []

with open('Garmin_GC.csv', 'w') as file:
    writer = csv.writer(file)
    for i in range(3):  #893
        url = "http://sites.garmin.com/clsearch/courses/search?course=&location=&country=US&state=&holes=&radius=&lang=en&search_submitted=1&per_page={}".format(
            i * 20)
        r = requests.get(url)
        soup = BeautifulSoup(r.text)
        g_data2 = soup.find_all("div", {"class": "result"})
        for item in g_data2:
            try:
                name = item.find_all("div", {"class": "name"})[0].text
            except IndexError:
                name = ''
                print "No Name found!"
            try:    
                address = item.find_all("div", {"class": "location"})[0].get_text(separator=' ')
                print address
            except IndexError:
                address = ''
                print "No Address found!"
            writer.writerow([name.encode("utf-8"), address.encode("utf-8")])

1 个答案:

答案 0 :(得分:1)

使用.stripped_strings generator

address = list(item.find('div', class_='location').stripped_strings)

这将产生两个字符串的列表:

>>> from bs4 import BeautifulSoup
>>> markup = '''<div class="location">10799 E 550 S<br>Zionsville, Indiana, United States</div>'''
>>> soup = BeautifulSoup(markup)
>>> list(soup.find('div', class_='location').stripped_strings)
[u'10799 E 550 S', u'Zionsville, Indiana, United States']

将其放在代码的上下文中:

try:
    name = item.find('div', class_='name').text
except AttributeError:
    name = u''
try:
    address = list(item.find('div', class_='location').stripped_strings)
except AttributeError:
    address = [u'', u'']
writer.writerow([v.encode("utf-8") for v in [name] + address])

其中两个地址值写入两个单独的列。