我正在使用beautifulSoup解析以下HTML:
<div id="cpv_codes">
<span>
79000000 - Business services: law, marketing, consulting, recruitment, printing and security
<br/>
79632000 - Personnel-training services
<br/>
80000000 - Education and training services
<br/>
80511000 - Staff training services
<br/>
80530000 - Vocational training services
</span>
</div>
我正在尝试将内容转换为列表,以便可以将其放入csv中以便以后进行规范化。
目前,我正在使用一个非常丑陋的过程将数据锤成形状,我非常想写一些更优雅的东西。我确信通过更好地使用BS,我可以使用一行提取列表中的数据,任何人都可以帮我清理这段代码吗?
categories = tender_soup.find('div',{"id":"cpv_codes"}).findNext('span')
categories = unicode(categories) # converts tag output to a string
categories = categories.split('<br/>') # converts string to an array
categories = [category.replace('<span>', '') for category in categories] # removes '<span>' from items
categories = [category.replace('</span>', '') for category in categories] # removes '</span>' from items
categories = filter(None, categories) # filters out any empty items in the array
答案 0 :(得分:2)
NavigableString
课程会对此有所帮助:
from bs4 import NavigableString
span = tender_soup.find('div',{"id":"cpv_codes"}).findNext('span')
categories = [c.strip() for c in span.contents if isinstance(c, NavigableString)]
现在你有了清单
[u'79000000 - Business services: law, marketing, consulting, recruitment, printing and security',
u'79632000 - Personnel-training services',
u'80000000 - Education and training services',
u'80511000 - Staff training services',
u'80530000 - Vocational training services']
答案 1 :(得分:0)
您可能会发现regular expression
有用。
import re
categories = tender_soup.find('div',{"id":"cpv_codes"}).findNext('span')
categories = [itm for itm in re.split(r'\s{2,}', categories.text) if itm]
根据您的数据,类别将是这样的,
[u'79000000 - Business services: law, marketing, consulting, recruitment, printing and security',
u'79632000 - Personnel-training services',
u'80000000 - Education and training services',
u'80511000 - Staff training services',
u'80530000 - Vocational training services']