从中提取值

时间:2016-06-23 01:04:02

标签: javascript python web-scraping beautifulsoup

    <a href="javascript:popUp('http://www.abcd.com/calendar/event.php?calendar=1&amp;category=&amp;event=43221&amp;date=2016-02-22','520','520');" onmouseout="javascript:hideEventDetailsBox();" onmouseover="javascript:eventDetailsBox('&lt;b&gt;Time:&lt;/b&gt; 9:00\xa0AM-4:30\xa0PM&lt;br /&gt;&lt;b&gt;Title:&lt;/b&gt; Hello!&lt;br /&gt;&lt;b&gt;Location:&lt;/b&gt; Cultural World N Avenue &lt;br /&gt;&lt;b&gt;Description:&lt;/b&gt; abcdefghi');" style="font-family:Tahoma;font-size:small;color:#000000;">

我想使用Beautiful Soup4从上面的HTML中删除字段(时间/标题/描述/位置。我无法在“onmouseover”中访问这些属性。我尝试了以下内容:

print g_dataItem.contents[5].find_all(onmouseover=True)
for tag in g_dataItem.contents[5].findAll(onmouseover=True):
    print  tag['onmouseover']

获得部分。

javascript:eventDetailsBox('时间:上午9:00 - 下午4:30
标题:您好!
地点: 文化世界N大道
描述: abcdefghi');

但是一旦我得到了上面的,这是unicode,我无法从这里提取字段。有人可以帮忙吗?

1 个答案:

答案 0 :(得分:0)

试试这个:

from bs4 import BeautifulSoup

data = """<a href="javascript:popUp('http://www.abcd.com/calendar/event.php?calendar=1&amp;category=&amp;event=43221&amp;date=2016-02-22','520','520');" onmouseout="javascript:hideEventDetailsBox();" onmouseover="javascript:eventDetailsBox('&lt;b&gt;Time:&lt;/b&gt; 9:00\xa0AM-4:30\xa0PM&lt;br /&gt;&lt;b&gt;Title:&lt;/b&gt; Hello!&lt;br /&gt;&lt;b&gt;Location:&lt;/b&gt; Cultural World N Avenue &lt;br /&gt;&lt;b&gt;Description:&lt;/b&gt; abcdefghi');" style="font-family:Tahoma;font-size:small;color:#000000;">"""

b = BeautifulSoup(data)
onmouseover = b.find_all('a')[0].get('onmouseover').split("'")[1]

b = BeautifulSoup(onmouseover)
results = [{b_tag.text:b_tag.next_sibling.strip()} for b_tag in b.find_all('b')]
print results

<强>结果:

[
    {u'Time:': u'9:00\xa0AM-4:30\xa0PM'},
    {u'Title:': u'Hello!'},
    {u'Location:': u'Cultural World N Avenue'},
    {u'Description:': u'abcdefghi'}
]