我正在尝试解析此网站:https://www.scutify.com/stocks.html 我认为由于帧的问题(我是HTML的新手),当我使用BeautifulSoup来解析html链接时,股票(例如1-800-Flowers)没有显示出来。所以我然后将其保存为htm文件,现在可以看到股票
htm文件看起来像
<title>Stocks/ETFs Listing - US, Canadian, UK, Australian and Indian Stocks on Scutify</title>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="Listing of US, Canadian, UK, Australian and Indian Stocks and ETFs available on Scutify">
<meta name="keywords"
<ul class="company-list list-group" id="us-list">
<li class="list-group-item">
<a href="https://www.scutify.com/company.aspx?ticker=FLWS">1-800-Flowers.Com Inc - (FLWS)</a></li>
<li class="list-group-item">
<a href="https://www.scutify.com/company.aspx?ticker=FOX">21st Century Fox Inc - (FOX)</a></li>
....
我尝试了下面的脚本
downloadedfile = "C:/Users/vwxyz/Downloads/Stocks_ETFs.htm"
htm = open(downloadedfile,'r')
soup = BeautifulSoup(htm)
stocklist = soup.find("ul",class_= "company-list list-group")
print(stocklist)
然而,它会打印出一大堆文字。我只想要一份股票清单,即
FLWS
FOX
...
有人可以帮忙吗?
答案 0 :(得分:0)
迭代股票清单中的项目并在括号中提取零件:
import re
soup = BeautifulSoup(htm, "html.parser")
pattern = re.compile(r"\(([A-Z]+)\)$")
for item in soup.select(".company-list .list-group-item"):
match = pattern.search(item.get_text())
if match:
print(match.group(1))
其中\(([A-Z]+)\)$
将匹配一个左括号,后跟一个或多个大写字母(在一个组中捕获),后面跟着一个在字符串末尾的右括号。