修改已删除的网址并更改其扩展名

时间:2017-12-02 10:50:17

标签: python-3.x pdf web-scraping beautifulsoup html-parsing

我是编程并尝试从网站下载图像和PDF的新手。在源代码中,我需要的项目是带有部分网址的选项标签。该网站在下拉菜单中列出了这些项目,它们显示在iframe中,但每个项目都可以使用其完整网址在自己的页面上打开。

到目前为止,我的代码找到了选项,将部分网址附加到网页的基地址,为每个选项创建完整的网址,并删除最终的" /"从.tif和.TIF网址添加" .pdf"。

然而,对于.tif和.TIF网址,我需要更改"转换"到" pdf"在新页面中打开它们。有没有办法只对.tif.pdf和.TIF.pdf网址执行此操作,而其他网址保持不变?

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import os

my_url = 'http://example.com'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

options = page_soup.findAll("select",{"id":"images"})[0].findAll("option")
values = [o.get("value") for o in options]

split_values = [i.split("|", 1)[0] for i in values]
# The option value is split to separate the url from its label
# <option value="/convert/ASRIMG/new/hop.TIF/|New Form"></option>

new_val = []
for val in split_values:
    ext = os.path.splitext(val.rstrip('/'))[-1]
    new_ext = ext
    if ext.lower() == '.tif':
        new_ext += '.pdf'
    new_val.append(val.rstrip('/').replace(ext, new_ext))

for i in range (len(new_val)):
    image_urls = ('http://example.com' + new_val[i])

我目前的成绩:

print (new_val)

/ASRIMG/good.jpg
/ASRIMG/foo/bar1.jpg
/ASRIMG/foo/bar2.jpg
/ASRIMG/foo/bar3.jpg
/convert/ASRIMG/new/hop.TIF.pdf
/convert/REG/green1.tif.pdf
/convert/REG//green2.tif.pdf
/convert/SHIP/green3.tif.pdf
/convert/SHIP/green4.tif.pdf
/convert/SHIP/green5.tif.pdf
/SKETCHIMG/001.png
/SKETCH/002.JPG


print (image_urls)

http://example.com/ASRIMG/good.jpg
http://example.com/ASRIMG/foo/bar1.jpg
http://example.com/ASRIMG/foo/bar2.jpg
http://example.com/ASRIMG/foo/bar3.jpg
http://example.com/convert/ASRIMG/new/hop.TIF.pdf
http://example.com/convert/REG/green1.tif.pdf
http://example.com/convert/REG//green2.tif.pdf
http://example.com/convert/SHIP/green3.tif.pdf
http://example.com/convert/SHIP/green4.tif.pdf
http://example.com/convert/SHIP/green5.tif.pdf
http://example.com/SKETCHIMG/001.png
http://example.com/SKETCH/002.JPG

我需要什么:

http://example.com/ASRIMG/good.jpg
http://example.com/ASRIMG/foo/bar1.jpg
http://example.com/ASRIMG/foo/bar2.jpg
http://example.com/ASRIMG/foo/bar3.jpg
http://example.com/pdf/ASRIMG/new/hop.TIF.pdf
http://example.com/pdf/REG/green1.tif.pdf
http://example.com/pdf/REG//green2.tif.pdf
http://example.com/pdf/SHIP/green3.tif.pdf
http://example.com/pdf/SHIP/green4.tif.pdf
http://example.com/pdf/SHIP/green5.tif.pdf
http://example.com/SKETCHIMG/001.png
http://example.com/SKETCH/002.JPG

1 个答案:

答案 0 :(得分:0)

完成此步骤后:

split_values = [i.split("|", 1)[0] for i in values]

此代码处理上部和下部tif:

In [48]: import os

In [49]: split_values = ['/ASRIMG/good.jpg', '/convert/ASRIMG/new/hop.TIF/', 'SK
    ...: ETCHIMG/001.png']

In [50]: new_val = []

In [51]: for val in split_values:
    ...:     ext = os.path.splitext(val.rstrip('/'))[-1]
    ...:     new_ext = ext
    ...:     if ext.lower() == '.tif':
    ...:         new_ext += '.pdf'
    ...:     new_val.append(val.rstrip('/').replace(ext, new_ext))
    ...:
    ...:

这会从右侧的split_values列表中的每个值中删除.tif/,然后在最后添加.tif.pdf