使用下面的代码,我想从html输出中返回一个Python DataFrame。这可以通过Python中的包来完成吗?请参阅网页链接以获取表格格式。
from bs4 import BeautifulSoup
import urllib.request
r = urllib.request.urlopen("https://www.zacks.com/zrank/sector-industry-classification.php").read()
soup = BeautifulSoup(r, "html.parser")
soup.find_all("script")[16]
输出脚本:
<script>window.app_data =
{
columns : [
{ "mDataProp" : "Sector Group"
, "sTitle" : "Sector Group"
, "sClass" : "alpha"
, "bSortable" : true
}
,
{
"mDataProp" : "Sector Code"
, "sTitle" : "Sector Code"
, "sClass" : ""
, "bSortable" : false
}
,
{
"mDataProp" : "Medium(M) Industry Group"
, "sTitle" : "Medium(M) Industry Group"
, "sClass" : "alpha"
, "bSortable" : false
}
数据包含:
data" : [ { "Sector Group" : "<span title=\"Index\" >Index</span>", "Sector Code" : "0", "Medium(M) Industry Group" : "<span title=\"Indices\" >Indices</span>", "Medium(M) Industry Code" : "0", "Expanded(X) Industry Group" : "<span title=\"Indicies\" >Indicies</span>", "Expanded(X) Industry Code" : "400" } , { "Sector Group" : "<span title=\"Consumer Staples\" >Consumer Staple...</span>", "Sector Code" : "1", "Medium(M) Industry Group" : "<span title=\"Food\" >Food</span>", "Medium(M) Industry Code" : "3", "Expanded(X) Industry Group" : "<span title=\"Food - Meat Products\" >Food - Meat Pro...</span>", "Expanded(X) Industry Code" : "75" } , { "Sector Group" : "<span title=\"Consumer Staples\" >Consumer Staple...</span>", "Sector Code" : "1", "Medium(M) Industry Group" : "<span title=\"Cons Prod-misc Staples\" >Cons Prod-misc...</span>", "Medium(M) Industry Code" : "7", "Expanded(X) Industry Group" : "<span title=\"Funeral Services\" >Funeral Service...</span>", "Expanded(X) Industry Code" : "78" } , { "Sector Group" : "<span title=\"Consumer Staples\" >Consumer Staple...</span>", "Sector Code" : "1", "Medium(M) Industry Group" : "<span title=\"Food\" >Food</span>", "Medium(M) Industry Code" : "3", "Expanded(X) Industry Group" : "<span title=\"Food - Confectionery\" >Food - Confecti...</span>", "Expanded(X) Industry Code" : "72" } , { "Sector Group"
注意:要粘贴的数据太多。我也尝试了下面的内容,因为其他答案提出了类似的方法,除了我选择全部使用:
import re
pattern = re.compile("'.*': '.*'")
fields = dict(re.findall(pattern, soup))
print(fields)
输出为{}
答案 0 :(得分:2)
我相信有更好的方法来实现这一目标。但是,嘿,它给你你想要的。此外,最好将Selenium + PhantomJS用于此类任务。
from bs4 import BeautifulSoup
import requests
import json
import pandas as pd
request = requests.get('https://www.zacks.com/zrank/sector-industry-classification.php')
soup = BeautifulSoup(request.text, 'lxml')
#Tweaked the string for parsing. It's ugly solution. I have failed with regular expressions.
#You can achieve this with way better way.
data = soup.find_all("script")[16].text.split('data"')[1].strip()[3:].rstrip()[:-7]
json_data = json.loads('[' + data)
def get_title(key):
return BeautifulSoup(data[key],'lxml').find('span').attrs['title']
d = []
for data in json_data:
sector_group = get_title('Sector Group')
sector_code = data['Sector Code']
medium_industry_group =get_title('Medium(M) Industry Group')
medium_industry_code = data['Medium(M) Industry Code']
expanded_industry_group = get_title('Expanded(X) Industry Group')
expanded_industry_code = data['Expanded(X) Industry Code']
d.append((sector_group,sector_code,medium_industry_group,medium_industry_code,expanded_industry_group,expanded_industry_code))
print(pd.DataFrame(d,columns=('Sector Group','Sector Code','Medium(M) Industry Group','Medium(M) Industry Code','Expanded(X) Industry Group','Expanded(X) Industry Code')))
答案 1 :(得分:0)
pandas准备好了
pd.read_html('https://www.zacks.com/zrank/sector-industry-classification.php')