从给定网站的网页收集文本

时间:2016-09-17 07:30:27

标签: python beautifulsoup google-search-api

有一个网站,我经常访问并阅读“最佳建议”。这是我如何轻松提取我想要的文本......

import urllib2
from bs4 import BeautifulSoup  

mylist=list()

myurl='http://www.apartmenttherapy.com/carols-east-side-cottage-house-tour-194787'
s=urllib2.urlopen(myurl)
soup =  BeautifulSoup(s)

hello = soup.find(text='Best Advice: ')
mylist.append(hello.next)

但是如何从所有页面收集文本片段?

我可以使用这个简单的谷歌查询搜索所有页面...

网站:http://www.apartmenttherapy.com

谷歌搜索是否有可以在python中使用的API? 我正在为这个问题寻找一次简单的解决方案。所以我不想安装太多的软件包来完成这项任务。

2 个答案:

答案 0 :(得分:1)

您可以先阅读BeautifulSoup手册,并学习使用网页开发工具检查网络流量。

完成后,您可能会看到可以通过GET请求获取房屋列表http://www.apartmenttherapy.com/search?page=1&q=House+Tour&type=all

假设我们可以从第1页迭代到X以获取所有房屋索引页面。

在每个索引页面上,您只需要15个网址即可添加到列表中。

获得完整的网址列表后,您可以删除每个网址,以获得每个网址上的“最佳建议”文字。

请参阅以下代码来完成工作:

import time
import requests
import random
from bs4 import BeautifulSoup  

#here we get a list of all url to scrap
url_list=[]
max_index=2 

for page_index in range(1,max_index):

    #get index page
    html=requests.get("http://www.apartmenttherapy.com/search?page="+str(page_index)+"&q=House+Tour&type=all").content

    #iterate over teaser
    for teaser in BeautifulSoup(html).findAll('a',{'class':'SimpleTeaser'}):

        #add link to url list
        url_list.append(teaser['href'])

    #sleep a litte to avoid overload/ to be smart
    time.sleep(random.random()/2.) # respect server side load

    #here I break because it s just an example (it does not required to scrap all index page)
    break #comment this break in production


#here we show list  
print url_list


#we iterate over url to get the advice
mylist=[]
for url in url_list:

    #get teaser page
    html=requests.get(url).content

    #find best advice text
    hello = BeautifulSoup(html).find(text='Best Advice: ')

    #print advice
    print "advice for",url,"\n","=>",

    #try to add next text to mylist
    try:
        mylist.append(hello.next)
    except:
        pass

    #sleep a litte to avoid overload/ to be smart
    time.sleep(random.random()/2.) # respect server side load

#show list of advice
print mylist

输出是:

['http://www.apartmenttherapy.com/house-tour-a-charming-comfy-california-cottage-228229', 'http://www.apartmenttherapy.com/christinas-olmay-oh-my-house-tour-house-tour-191725', 'http://www.apartmenttherapy.com/house-tour-a-rustic-refined-ranch-house-227896', 'http://www.apartmenttherapy.com/caseys-grown-up-playhouse-house-tour-215962', 'http://www.apartmenttherapy.com/allison-and-lukes-comfortable-and-eclectic-apartment-house-tour-193440', 'http://www.apartmenttherapy.com/melissas-eclectic-austin-bungalow-house-tour-206846', 'http://www.apartmenttherapy.com/kates-house-tour-house-tour-197080', 'http://www.apartmenttherapy.com/house-tour-a-1940s-art-deco-apartment-in-australia-230294', 'http://www.apartmenttherapy.com/house-tour-an-art-filled-mid-city-new-orleans-house-227667', 'http://www.apartmenttherapy.com/jeremys-light-and-heavy-home-house-tour-201203', 'http://www.apartmenttherapy.com/mikes-cabinet-of-curiosities-house-tour-201878', 'http://www.apartmenttherapy.com/house-tour-a-family-dream-home-in-illinois-227791', 'http://www.apartmenttherapy.com/stephanies-greenwhich-gemhouse-96295', 'http://www.apartmenttherapy.com/masha-and-colins-worldly-abode-house-tour-203518', 'http://www.apartmenttherapy.com/tims-desert-light-box-house-tour-196764']
advice for http://www.apartmenttherapy.com/house-tour-a-charming-comfy-california-cottage-228229 
=> advice for http://www.apartmenttherapy.com/christinas-olmay-oh-my-house-tour-house-tour-191725 
=> advice for http://www.apartmenttherapy.com/house-tour-a-rustic-refined-ranch-house-227896 
=> advice for http://www.apartmenttherapy.com/caseys-grown-up-playhouse-house-tour-215962 
=> advice for http://www.apartmenttherapy.com/allison-and-lukes-comfortable-and-eclectic-apartment-house-tour-193440 
=> advice for http://www.apartmenttherapy.com/melissas-eclectic-austin-bungalow-house-tour-206846 
=> advice for http://www.apartmenttherapy.com/kates-house-tour-house-tour-197080 
=> advice for http://www.apartmenttherapy.com/house-tour-a-1940s-art-deco-apartment-in-australia-230294 
=> advice for http://www.apartmenttherapy.com/house-tour-an-art-filled-mid-city-new-orleans-house-227667 
=> advice for http://www.apartmenttherapy.com/jeremys-light-and-heavy-home-house-tour-201203 
=> advice for http://www.apartmenttherapy.com/mikes-cabinet-of-curiosities-house-tour-201878 
=> advice for http://www.apartmenttherapy.com/house-tour-a-family-dream-home-in-illinois-227791 
=> advice for http://www.apartmenttherapy.com/stephanies-greenwhich-gemhouse-96295 
=> advice for http://www.apartmenttherapy.com/masha-and-colins-worldly-abode-house-tour-203518 
=> advice for http://www.apartmenttherapy.com/tims-desert-light-box-house-tour-196764 
=> [u"If you make a bad design choice or purchase, don't be afraid to change it. Try and try again until you love it.\n\t", u" Sisal rugs. They clean up easily and they're very understated. Start with very light colors and add colors later.\n", u"Bring in what you love, add dimension and texture to your walls. Decorate as an individual and not to please your neighbor or the masses. Trends are fun but I love elements of timeless interiors. Include things from any/every decade as well as mixing styles. I'm convinced it's the hardest way to decorate without looking like you are living in a flea market stall. Scale, color, texture, and contrast are what I focus on. For me it takes some toying around, and I always consider how one item affects the next. Consider space and let things stand out by limiting what surrounds them.", u'You don\u2019t need to invest in \u201cdecor\u201d and nothing needs to match. Just decorate with the special things (books, cards, trinkets, jars, etc.) that you\u2019ve collected over the years, and be organized. I honestly think half the battle of having good home design is keeping a neat house. The other half is just displaying stuff that is special to you. Stuff that has a story and/or reminds you of people, ideas, and places that you love. One more piece of advice - the best place to buy picture frames is Goodwill. Pick a frame in decent condition, and just paint it to complement your palette. One last piece of advice\u2014 decor need not be pricey. I ALWAYS shop consignment and thrift, and then I repaint and customize as I see fit.\n', u'From my sister \u2014 to use the second bedroom as my room, as it is dark and quiet, both of which I need in order to sleep.\n', u'Collect things that you love in your travels throughout life. I tend to purchase ceramics when travelling, sometimes a collection of bowls\u2026 not so easy transporting in the suitcase, but no breakages yet!\n\t', u'Keep things authentic to the character of your home and to the character of your family. Then, you can never go wrong!\n\t', u'Contemporary architecture does not require contemporary furnishings.\n']

答案 1 :(得分:0)

你必须像这里解释的那样使用js-enabled抓取: http://koaning.io/dynamic-scraping-with-python.html