我有一个网站的代码结构,我想刮。
<span class="blk">Society/Project: <b>Sai Sparsh</b></span>
<i class="blk">
Built-up Area: <b>1005 Sq.Ft.</b>
@ <i class="WebRupeesmall b mr_5 f14">Rs.</i>6109/sq.ft</i>
我已经通过以下代码抓取了一些数据
properties = soup.findAll('a', title=re.compile('Bedroom'))
for eachproperty in properties:
print today,","+"http:/"+ eachproperty['href']+",", eachproperty.string+"," +",".join(re.findall("'([a-zA-Z0-9,\s]*)'", eachproperty['onclick']))
我的输出是
2013-09-05 ,http://Residential-Apartment-Flat-in-Velachery-Chennai South-3-Bedroom-bhk-for-Sale-spid-E10766779, 3 Bedroom, Residential Apartment in Velachery,E10766779,9952946340,,Dealer,Bala
因此,对于上面定义的HTML结构,我试图剥离并获得如下输出
Sai Sparsh, 1005 Sq.Ft, 6109/sq.ft
并将其附加到已经生成的输出(如上所述)。我一直在低头向下导航树并使用REGEX。
更新
以下是我尝试使用代码
的内容cname = soup.findAll('span', {'class':'blk'})
pmoney = soup.findAll('i',{'class':'blk'})
for eachproperty in cname:
for each in pmoney:
tey = re.sub('(\s{2,})', ' ', eachproperty.text)[17:]
ting = re.sub('([0-9,\s]*)', ' ', each.text)
print tey + ting
我的输出是
Rams Jai Vignesh Built-up Area: 1050 Sq.Ft. @ Rs.5524/sq.ft
Shrudhi Homes Built-up Area: 1050 Sq.Ft. @ Rs.5524/sq.ft
Ashtalakshmi Homes Built-up Area: 1050 Sq.Ft. @ Rs.5524/sq.ft
Raj Flats Built-up Area: 1050 Sq.Ft. @ Rs.5524/sq.ft
但我希望我的输出没有'建筑面积:','@','Rs'。 所以它应该只是
Rams Jai Vignesh ,1050 ,5524
Shrudhi Homes ,1050 , 5524
答案 0 :(得分:2)
为什么不使用text
属性:
import re
from bs4 import BeautifulSoup as Soup
soup = Soup("""<span class="blk">Society/Project: <b>Sai Sparsh</b></span>
<i class="blk">
Built-up Area: <b>1005 Sq.Ft.</b>
@ <i class="WebRupeesmall b mr_5 f14">Rs.</i>6109/sq.ft</i>""")
print re.sub('(\s{2,})', ' ', soup.text)
打印:
Society/Project: Sai Sparsh Built-up Area: 1005 Sq.Ft. @ Rs.6109/sq.ft
仅供参考,re.sub
用于美化字符串,因为有多个空格等。
UPD:这是你的刮刀脚本:
import re
import urllib2
from bs4 import BeautifulSoup as Soup
html = urllib2.urlopen("http://www.99acres.com/property-in-velachery-chennai-south-ffid").read()
soup = Soup(html)
re_digit = re.compile('(\d+)')
for div in soup.find_all('div', {'class': 'sT_disc grey'}):
try:
project = div.find('span').find('b').text.strip()
except:
project = 'No project'
area = re.findall(re_digit, div.find('i', {'class': 'blk'}).text.strip())
print ", ".join([project] + area)