<a href="javascript:popUp('http://www.abcd.com/calendar/event.php?calendar=1&category=&event=43221&date=2016-02-22','520','520');" onmouseout="javascript:hideEventDetailsBox();" onmouseover="javascript:eventDetailsBox('<b>Time:</b> 9:00\xa0AM-4:30\xa0PM<br /><b>Title:</b> Hello!<br /><b>Location:</b> Cultural World N Avenue <br /><b>Description:</b> abcdefghi');" style="font-family:Tahoma;font-size:small;color:#000000;">
我想使用Beautiful Soup4从上面的HTML中删除字段(时间/标题/描述/位置。我无法在“onmouseover”中访问这些属性。我尝试了以下内容:
print g_dataItem.contents[5].find_all(onmouseover=True)
for tag in g_dataItem.contents[5].findAll(onmouseover=True):
print tag['onmouseover']
获得部分。
javascript:eventDetailsBox('时间:上午9:00 - 下午4:30
标题:您好!
地点: 文化世界N大道
描述: abcdefghi');
但是一旦我得到了上面的,这是unicode,我无法从这里提取字段。有人可以帮忙吗?
答案 0 :(得分:0)
试试这个:
from bs4 import BeautifulSoup
data = """<a href="javascript:popUp('http://www.abcd.com/calendar/event.php?calendar=1&category=&event=43221&date=2016-02-22','520','520');" onmouseout="javascript:hideEventDetailsBox();" onmouseover="javascript:eventDetailsBox('<b>Time:</b> 9:00\xa0AM-4:30\xa0PM<br /><b>Title:</b> Hello!<br /><b>Location:</b> Cultural World N Avenue <br /><b>Description:</b> abcdefghi');" style="font-family:Tahoma;font-size:small;color:#000000;">"""
b = BeautifulSoup(data)
onmouseover = b.find_all('a')[0].get('onmouseover').split("'")[1]
b = BeautifulSoup(onmouseover)
results = [{b_tag.text:b_tag.next_sibling.strip()} for b_tag in b.find_all('b')]
print results
<强>结果:强>
[
{u'Time:': u'9:00\xa0AM-4:30\xa0PM'},
{u'Title:': u'Hello!'},
{u'Location:': u'Cultural World N Avenue'},
{u'Description:': u'abcdefghi'}
]