如何使用BeautifulSoup解析嵌套项?

时间:2016-04-07 22:40:01

标签: python html beautifulsoup

我正在尝试解析此网站:https://www.scutify.com/stocks.html 我认为由于帧的问题(我是HTML的新手),当我使用BeautifulSoup来解析html链接时,股票(例如1-800-Flowers)没有显示出来。所以我然后将其保存为htm文件,现在可以看到股票

htm文件看起来像

<title>Stocks/ETFs Listing - US, Canadian, UK, Australian and Indian Stocks on Scutify</title> 
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="Listing of US, Canadian, UK, Australian and Indian Stocks and ETFs available on Scutify"> 
<meta name="keywords"  
<ul class="company-list list-group" id="us-list">
 <li class="list-group-item">
   <a href="https://www.scutify.com/company.aspx?ticker=FLWS">1-800-Flowers.Com Inc - (FLWS)</a></li>
 <li class="list-group-item">
   <a href="https://www.scutify.com/company.aspx?ticker=FOX">21st Century Fox Inc - (FOX)</a></li>
....

我尝试了下面的脚本

downloadedfile = "C:/Users/vwxyz/Downloads/Stocks_ETFs.htm"
htm = open(downloadedfile,'r')
soup = BeautifulSoup(htm)
stocklist = soup.find("ul",class_= "company-list list-group")
print(stocklist)

然而,它会打印出一大堆文字。我只想要一份股票清单,即

FLWS
FOX
...

有人可以帮忙吗?

1 个答案:

答案 0 :(得分:0)

迭代股票清单中的项目并在括号中提取零件:

import re

soup = BeautifulSoup(htm, "html.parser")

pattern = re.compile(r"\(([A-Z]+)\)$")
for item in soup.select(".company-list .list-group-item"):
    match = pattern.search(item.get_text())
    if match:
         print(match.group(1))

其中\(([A-Z]+)\)$将匹配一个左括号,后跟一个或多个大写字母(在一个组中捕获),后面跟着一个在字符串末尾的右括号。