使用Python提取站点的子分区文本

时间:2014-08-26 06:59:25

标签: python html tags beautifulsoup web-crawler

我一直在提取li标签之间的文本。以下是部分html页面源代码

<div class="item_desc_text">
    <ul class="fk-key-features">
      <li>1.2 GHz Qualcomm Snapdragon 400 Quad Core Processor and 1 GB RAM</li><li>Android v4.4 (KitKat) OS</li>
      <li>Wi-Fi Enabled</li><li>8 GB Internal Memory</li><li>Dual SIM (GSM + GSM)</li><li>HD Recording</li>
      <li>5 MP Primary Camera and 1.3 MP Secondary Camera</li><li>4.5-inch HD Display</li>
    </ul>
</div>

我使用以下代码来提取输出

import bs4
import re
suburl="http://www.flipkart.com/moto-g/p/itmdsmbxcrm9wy8r?pid=MOBDSGU2ZMDYENQA&icmpid=reco_pp_hSame_mobile_1"
subhtml = urllib2.urlopen(suburl)
subhtml = subhtml.read()
subhtml = re.sub(r'\s\s+','',subhtml)
subsoup=bs4.BeautifulSoup(subhtml)
print "Key features of "+Name.get_text()
    for res2 in subsoup.findAll('div',attrs={'class':'item_desc_text'}):   
        print res2

我该怎么办?

1 个答案:

答案 0 :(得分:0)

这是一种方法:

>>> from bs4 import BeautifulSoup as bs
>>> data = '''
... <div class="item_desc_text">
...     <ul class="fk-key-features">
...       <li>1.2 GHz Qualcomm Snapdragon 400 Quad Core Processor and 1 GB RAM</li><li>Android v4.4 (KitKat) OS</li>
...       <li>Wi-Fi Enabled</li><li>8 GB Internal Memory</li><li>Dual SIM (GSM + GSM)</li><li>HD Recording</li>
...       <li>5 MP Primary Camera and 1.3 MP Secondary Camera</li><li>4.5-inch HD Display</li>
...     </ul>
... </div>
... '''
>>> soup = bs(data)
>>> ul = soup.find('ul', attrs={'class':'fk-key-features'})
>>> for item in ul.find_all('li'):
...     print item.get_text().strip()
...
1.2 GHz Qualcomm Snapdragon 400 Quad Core Processor and 1 GB RAM
Android v4.4 (KitKat) OS
Wi-Fi Enabled
8 GB Internal Memory
Dual SIM (GSM + GSM)
HD Recording
5 MP Primary Camera and 1.3 MP Secondary Camera
4.5-inch HD Display