如何在Python中替换字符串的特定部分

时间:2012-08-03 16:29:46

标签: python python-2.7 web-scraping

截至目前我正试图刮掉Good.is。截至目前的代码给了我常规图像(将if语句转为True)但我想要更高分辨率的图像。我想知道如何替换某个文本以便我可以下载高分辨率图片。我想将html:http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html更改为http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flat.html(结尾不同)。我的代码是:

import os, urllib, urllib2
from BeautifulSoup import BeautifulSoup
import HTMLParser

parser = HTMLParser.HTMLParser()

# make folder.
folderName = 'Good.is'
if not os.path.exists(folderName):
  os.makedirs(folderName)


list = [] 
# Python ranges start from the first argument and iterate up to one
# less than the second argument, so we need 36 + 1 = 37
for i in range(1, 37):
    list.append("http://www.good.is/infographics/page:" + str(i) + "/sort:recent/range:all")


listIterator1 = []
listIterator1[:] = range(0,37)      
counter = 0


for x in listIterator1:


    soup = BeautifulSoup(urllib2.urlopen(list[x]).read())

    body = soup.findAll("ul", attrs = {'id': 'gallery_list_elements'})

    number = len(body[0].findAll("p"))
    listIterator = []
    listIterator[:] = range(0,number)        

    for i in listIterator:
        paragraphs = body[0].findAll("p")
        nextArticle = body[0].findAll("a")[2]
        text = body[0].findAll("p")[i]

        if len(paragraphs) > 0:
            #print image['src']
            counter += 1
            print counter
            print parser.unescape(text.getText())
            print "http://www.good.is" + nextArticle['href']
            originalArticle = "http://www.good.is" + nextArticle['href']
            article = BeautifulSoup(urllib2.urlopen(originalArticle).read())
            title = article.findAll("div", attrs = {'class': 'title_and_image'})
            getTitle = title[0].findAll("h1") 
            article1 = article.findAll("div", attrs = {'class': 'body'})
            articleImage = article1[0].find("p")
            betterImage = articleImage.find("a")
            articleImage1 = articleImage.find("img")
            paragraphsWithinSection = article1[0].findAll("p")
            print betterImage['href']
            if len(paragraphsWithinSection) > 1:
                articleText = article1[0].findAll("p")[1]
            else:
                articleText = article1[0].findAll("p")[0]
            print articleImage1['src']
            print parser.unescape(getTitle)
            if not articleText is None:
                print parser.unescape(articleText.getText())
            print '\n'
            link = articleImage1['src']
            x += 1


            actually_download = False
            if actually_download:
                filename = link.split('/')[-1]
                urllib.urlretrieve(link, filename)

4 个答案:

答案 0 :(得分:3)

看看str.replace。如果这不足以完成工作,则需要使用正则表达式(re - 可能re.sub)。

>>> str1="http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html"
>>> str1.replace("flash","flat")
'http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flat.html'

答案 1 :(得分:1)

我认为最安全和最简单的方法是使用正则表达式:

import re
url = 'http://www.google.com/this/is/sample/url/flash.html'
newUrl = re.sub('flash\.html$','flat.html',url)

“$”表示只匹配字符串的结尾。即使在你的url包含除了结尾之外的某个子字符串“flash.html”的(事实上不太可能的)事件中,该解决方案也会正常运行,并且如果它不结束也会保持字符串不变(我认为这是正确的行为)用'flash.html'。

请参阅:http://docs.python.org/library/re.html#re.sub

答案 2 :(得分:0)

@mgilson有一个很好的解决方案,但问题是它会用替换替换所有出现的字符串;因此,如果您将“flash”作为URL的一部分(而不仅仅是尾随文件名),您将有多个替换:

>>> str = 'hello there hello'
>>> str.replace('hello','world')
'world there world' 

另一种解决方案是将/后的最后一部分替换为flat.html

>>> url = 'http://www.google.com/this/is/sample/url/flash.html'
>>> url[:url.rfind('/')+1]+'flat.html'
'http://www.google.com/this/is/sample/url/flat.html'

答案 3 :(得分:0)

使用urlparse你可以做几个小点并且选择:

from urlparse import urlsplit, urlunsplit, urljoin

s = 'http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html'

url = urlsplit(s)
head, tail = url.path.rsplit('/', 1)
new_path = head, 'flat.html'
print urlunsplit(url._replace(path=urljoin(*new_path)))