Question

截至目前我正试图刮掉Good.is。截至目前的代码给了我常规图像（将if语句转为True）但我想要更高分辨率的图像。我想知道如何替换某个文本以便我可以下载高分辨率图片。我想将html：http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html更改为http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flat.html（结尾不同）。我的代码是：

import os, urllib, urllib2
from BeautifulSoup import BeautifulSoup
import HTMLParser

parser = HTMLParser.HTMLParser()

# make folder.
folderName = 'Good.is'
if not os.path.exists(folderName):
  os.makedirs(folderName)


list = [] 
# Python ranges start from the first argument and iterate up to one
# less than the second argument, so we need 36 + 1 = 37
for i in range(1, 37):
    list.append("http://www.good.is/infographics/page:" + str(i) + "/sort:recent/range:all")


listIterator1 = []
listIterator1[:] = range(0,37)      
counter = 0


for x in listIterator1:


    soup = BeautifulSoup(urllib2.urlopen(list[x]).read())

    body = soup.findAll("ul", attrs = {'id': 'gallery_list_elements'})

    number = len(body[0].findAll("p"))
    listIterator = []
    listIterator[:] = range(0,number)        

    for i in listIterator:
        paragraphs = body[0].findAll("p")
        nextArticle = body[0].findAll("a")[2]
        text = body[0].findAll("p")[i]

        if len(paragraphs) > 0:
            #print image['src']
            counter += 1
            print counter
            print parser.unescape(text.getText())
            print "http://www.good.is" + nextArticle['href']
            originalArticle = "http://www.good.is" + nextArticle['href']
            article = BeautifulSoup(urllib2.urlopen(originalArticle).read())
            title = article.findAll("div", attrs = {'class': 'title_and_image'})
            getTitle = title[0].findAll("h1") 
            article1 = article.findAll("div", attrs = {'class': 'body'})
            articleImage = article1[0].find("p")
            betterImage = articleImage.find("a")
            articleImage1 = articleImage.find("img")
            paragraphsWithinSection = article1[0].findAll("p")
            print betterImage['href']
            if len(paragraphsWithinSection) > 1:
                articleText = article1[0].findAll("p")[1]
            else:
                articleText = article1[0].findAll("p")[0]
            print articleImage1['src']
            print parser.unescape(getTitle)
            if not articleText is None:
                print parser.unescape(articleText.getText())
            print '\n'
            link = articleImage1['src']
            x += 1


            actually_download = False
            if actually_download:
                filename = link.split('/')[-1]
                urllib.urlretrieve(link, filename)

Answer 1

看看str.replace。如果这不足以完成工作，则需要使用正则表达式（re - 可能re.sub）。

>>> str1="http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html"
>>> str1.replace("flash","flat")
'http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flat.html'

Answer 2

我认为最安全和最简单的方法是使用正则表达式：

import re
url = 'http://www.google.com/this/is/sample/url/flash.html'
newUrl = re.sub('flash\.html$','flat.html',url)

“$”表示只匹配字符串的结尾。即使在你的url包含除了结尾之外的某个子字符串“flash.html”的（事实上不太可能的）事件中，该解决方案也会正常运行，并且如果它不结束也会保持字符串不变（我认为这是正确的行为）用'flash.html'。

请参阅：http://docs.python.org/library/re.html#re.sub

Answer 3

@mgilson有一个很好的解决方案，但问题是它会用替换替换所有出现的字符串;因此，如果您将“flash”作为URL的一部分（而不仅仅是尾随文件名），您将有多个替换：

>>> str = 'hello there hello'
>>> str.replace('hello','world')
'world there world'

另一种解决方案是将/后的最后一部分替换为flat.html：

>>> url = 'http://www.google.com/this/is/sample/url/flash.html'
>>> url[:url.rfind('/')+1]+'flat.html'
'http://www.google.com/this/is/sample/url/flat.html'

Answer 4

使用urlparse你可以做几个小点并且选择：

from urlparse import urlsplit, urlunsplit, urljoin

s = 'http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html'

url = urlsplit(s)
head, tail = url.path.rsplit('/', 1)
new_path = head, 'flat.html'
print urlunsplit(url._replace(path=urljoin(*new_path)))

如何在Python中替换字符串的特定部分

4 个答案: