我有一个很大的HTML文件,我需要使用正则表达式解析一些数据。第一个是餐厅的名称。酒店名称采用以下格式:
更新
<html><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8"></head><body><div class="businessresult clearfix">
<div class="leftcol">
<div id="bizTitle0" class="itemheading">
<a href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco" id="bizTitleLink0">1. Capannina
</a>
</div>
<div class="itemcategories">
Categories: <a href="https://courses.ischool.berkeley.edu/search?mapsize=small&main_places=CA%3ASan_Francisco%3A%3ASOMA%2CCA%3ASan_Francisco%3A%3APacific_Heights%2CCA%3ASan_Francisco%3A%3AMission%2CCA%3ASan_Francisco%3A%3AHaight-Ashbury&places=CA%3ASan_Francisco%3A%3A%5BSOMA%2CMission%2CMarina%2FCow_Hollow%5D&rpp=40&bbox=-122.471809387%2C37.7384127869%2C-122.368125916%2C37.8203616433&attrs=&sortby=category&show_more_search_options=true&cflt=italian&find_loc=san+francisco%2C+ca" rel="italian" class="category" id="cat_result_0_italian">Italian</a>, <a href="https://courses.ischool.berkeley.edu/search?mapsize=small&main_places=CA%3ASan_Francisco%3A%3ASOMA%2CCA%3ASan_Francisco%3A%3APacific_Heights%2CCA%3ASan_Francisco%3A%3AMission%2CCA%3ASan_Francisco%3A%3AHaight-Ashbury&places=CA%3ASan_Francisco%3A%3A%5BSOMA%2CMission%2CMarina%2FCow_Hollow%5D&rpp=40&bbox=-122.471809387%2C37.7384127869%2C-122.368125916%2C37.8203616433&attrs=&sortby=category&show_more_search_options=true&cflt=seafood&find_loc=san+francisco%2C+ca" rel="seafood" class="category" id="cat_result_0_seafood">Seafood</a>
</div>
<div class="itemneighborhoods">
Neighborhood: <a href="https://courses.ischool.berkeley.edu/search?find_desc=&mapsize=small&main_places=CA%3ASan_Francisco%3A%3ASOMA%2CCA%3ASan_Francisco%3A%3APacific_Heights%2CCA%3ASan_Francisco%3A%3AMission%2CCA%3ASan_Francisco%3A%3AHaight-Ashbury&places=CA%3ASan_Francisco%3A%3A%5BSOMA%2CMission%2CMarina%2FCow_Hollow%5D&attrs=&sortby=category&cflt=italian&show_more_search_options=true&parent_request_id=9536eaa25db61373&find_loc=Marina%2FCow+Hollow%2C+San+Francisco%2C+CA" title="Marina/Cow Hollow, San Francisco, CA" class="location" id="hood_result_0_0">Marina/Cow Hollow</a>
</div>
</div>
<div class="rightcol">
<div class="rating"><img src="yelp_listings_files/stars_map.html" alt="4 star rating" title="4 star rating" class="stars_4 " height="325" width="83"></div> <a class="reviews" href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco">270 reviews</a>
<address>
1809 Union St<br>San Francisco, CA 94123<br>
</address><div class="phone">
(415) 409-8001
</div>
</div>
共有40家酒店。我认为.
之后有两个空格。我需要列出1 to 40
的所有酒店。我尝试过使用:
re.findall("[./0-9]", string_Name)
输出数字。我想得到这个号码和所有的酒店名称。我怎么能这样做?
Blender的回答给出了评级和餐馆列表。这很好,但我想要评级和餐馆名称在另一个变量。
答案 0 :(得分:5)
解析HTML:
import re
from bs4 import BeautifulSoup
html = '''
<a href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco" id="bizTitleLink0">1. Capannina
</a>
<a href="https://courses.ischool.berkeley.edu/biz/ristorante-parma-san-francisco" id="bizTitleLink4">5. Ristorante Parma
</a>
'''
soup = BeautifulSoup(html)
for link in soup.find_all('a', text=re.compile(r'^\d')):
print link.get_text()
输出:
1. Capannina
5. Ristorante Parma
答案 1 :(得分:0)
你不应该直接在html上运行正则表达式(更喜欢先使用HTML解析器),但试试这个正则表达式:
(\d+)\.\s+([^<]+)
一个或多个数字
一个点
一个或多个空白字符
一个或多个非<
个字母
括号()的存在会创建一个捕获组。捕获组1的内容将是数字。捕获组2的内容将是名称。