我有以下要摘录成Python列表的HTML清单格式的HTML摘录。这是一周中每天的时间表。
[u'
<table class="hours table">\n
<tbody>\n
<tr>\n
<th scope="row">Mon</th>\n
<td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
<br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
<td class="extra">\n </td>\n </tr>\n\n
<tr>\n
<th scope="row">Tue</th>\n
<td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
<br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
<td class="extra">\n </td>\n </tr>\n\n
<tr>\n
<th scope="row">Wed</th>\n
<td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
<br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
<td class="extra">\n <span class="nowrap open">Open now</span>\n </td>\n </tr>\n\n
<tr>\n
<th scope="row">Thu</th>\n
<td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
<br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
<td class="extra">\n </td>\n </tr>\n\n
<tr>\n
<th scope="row">Fri</th>\n
<td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
<br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
<td class="extra">\n </td>\n </tr>\n\n
<tr>\n
<th scope="row">Sat</th>\n
<td>\n <span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
<td class="extra">\n </td>\n </tr>\n\n
<tr>\n
<th scope="row">Sun</th>\n
<td>\n Closed\n </td>\n
<td class="extra">\n </td>\n </tr>\n\n </tbody>\n </table>']
希望的输出是:
{
'Mon': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'],
'Tue': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'],
'Wed': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'],
'Thu': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'],
'Fri': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'],
'Sat': '5:00pm - 10:00pm',
'Sun': 'Closed'
}
您将如何在Python 3.x中实现这一目标?我不介意'Sat'和'Sun'键是否具有列表格式的值(如果有帮助的话)。谢谢您事先的想法。
答案 0 :(得分:3)
这是一个解决方案,它首先读入Pandas DataFrame,然后按照所需的输出转换为字典:
import pandas as pd
dfs = pd.read_html(html_string)
df = dfs[0] # pd.read_html reads in all tables and returns a list of DataFrames
给予:
0 1 2
0 Mon 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm NaN
1 Tue 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm NaN
2 Wed 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm Open now
3 Thu 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm NaN
4 Fri 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm NaN
5 Sat 5:00 pm - 10:00 pm NaN
6 Sun Closed NaN
然后使用groupby
和字典理解:
summary = {k: v.iloc[0, 1].split(' ') for k, v in df.groupby(0)}
给予:
{'Fri': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Mon': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Sat': ['5:00 pm - 10:00 pm'],
'Sun': ['Closed'],
'Thu': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Tue': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Wed': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']}
如果仅对两个空格进行拆分对您的营业时间数据格式而言并不总是可行,则可能需要进行一些编辑。
答案 1 :(得分:2)
使用库来解析HTML,如下所示:
import pandas as panda
url = r'https://en.wikipedia.org/wiki/List_of_sovereign_states'
tables = panda.read_html(url)
sp500_table = tables[0] #Selecting the first table (for example)
答案 2 :(得分:2)
from bs4 import BeautifulSoup
from collections import OrderedDict
from pprint import pprint
soup = BeautifulSoup(data, 'lxml')
d = OrderedDict()
for th, td in zip(soup.select('th'), soup.select('td')[::2]):
d[th.text.strip()] = td.text.strip().splitlines()
pprint(d)
打印:
OrderedDict([('Mon', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
('Tue', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
('Wed', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
('Thu', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
('Fri', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
('Sat', ['5:00 pm - 10:00 pm']),
('Sun', ['Closed'])])
答案 3 :(得分:1)
from bs4 import BeautifulSoup
def tables(file):
data= {}
with open(file,"r") as f:
soup = BeautifulSoup(f.read(), "html.parser")
tables = soup.find_all('table')
for key,value in enumerate(tables):
data["table_"+key] = value
答案 4 :(得分:0)
尝试以下一种方法:
from bs4 import BeautifulSoup as b
yourdict={e.strip("\n").split("\n\n")[0]:e.strip().strip("\n").split("\n\n")[1].split("\n") for e in b(a,"lxml").text.split("\n\n\n\n")}
输出:
{'Fri': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Mon': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Sat': ['5:00 pm - 10:00 pm'],
'Sun': [' Closed'],
'Thu': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Tue': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Wed': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']}