我正在抓取存储在我自己的服务器中的Google搜索结果页面。我使用以下代码来刮取页面。
from string import punctuation, whitespace
import urllib2
import datetime
import re
from bs4 import BeautifulSoup as Soup
import csv
today = datetime.date.today()
html = urllib2.urlopen("http://192.168.1.200/coimbatore/3BHK_flats_inCoimbatore.html_%94201308110608%94.html").read()
soup = Soup(html)
p = re.compile(r'<.*?>')
aslink = soup.findAll('span',attrs={'class':'ac'})
for li in soup.findAll('li', attrs={'class':'g'}):
sLink = li.find('a')
sSpan = li.find('span', attrs={'class':'st'})
print sLink['href'][7:] , "," + p.sub('', str(sSpan)).replace('.','')
print p.sub('', str(aslink)).replace('.','\n')
这里的问题是我在输出中得到这个方括号
[No Pre EMI & Booking Amount Buy Now , Get Best Deals On 1/2/3 BHK Flats! Over 50000+ New Flats for Sale, Starts @ 3000 per Sqft
Enquire Us
, Your Dream Villa For SaleIn Coimbatore
Book a Visit!, Luxurious Properties In CoimbatoreBy Renowned Builder
Booking Open!, Finest 2BHK Flats at its Best PriceAvailable @ Rs
2500/sqft Visit Now!, Properties for every budgetAnd location
Explore Now!, Looking a 3BHK Flat In Alagapuram?Best Deal, Area 1598SqFt Book Now, Find 3 BHK Flats/Apts in Chennai
Over 200000 Properties
Search Now!, Buy Flats With Finest Amenities InCoimbatore
Elegant Club House
, 100% free classifieds
Apartmentsfor sale/rent on OLX
Find it now!]
此输出是从
行生成的print p.sub('', str(aslink)).replace('.','\n')
我想知道为什么这个支架会来,我也想删除它。
更新
这是我的页面http://jigar.zapto.org/coimbatore/3BHK_flats_inCoimbatore.html_%94201308110608%94.html
答案 0 :(得分:2)
findAll()
返回一个列表。如果您只想要一个元素,请改用.find()
,这将返回第一个结果:
aslink = soup.find('span',attrs={'class':'ac'})
方括号是您在列表对象上调用str()
的结果。或者,使用索引来获取一个元素:
print p.sub('', aslink[0]).replace('.','\n')
或循环遍历aslink
元素。
但是,看起来好像要从span
元素中提取所有文本。不要使用正则表达式,只需向BeautifulSoup询问所有文本内容:
for l in aslink:
print ' '.join(l.stripped_strings)