python正则表达式提取图像URL的最佳方法

时间:2018-09-02 12:51:18

标签: python regex

如何使用正则表达式从下面的html中仅提取xlarge URL?

'"xlarge":"https://i.ebayimg.com/00/s/NTU5WDEwMjY=/z/TQMAAOSwkrFaZqhh/$_20.PNG"},{"small":"https://i.ebayimg.com/00/s/ODAwWDgwMA==/z/uX0AAOSwvGlaZqhU/$_35.JPG","large":"https://i.ebayimg.com/00/s/ODAwWDgwMA==/z/uX0AAOSwvGlaZqhU/$_75.JPG","xlarge":"https://i.ebayimg.com/00/s/ODAwWDgwMA==/z/uX0AAOSwvGlaZqhU/$_20.JPG"},{"small":"https://i.ebayimg.com/00/s/NjMwWDk2MA==/z/n58AAOSwp-RaZqhn/$_35.PNG","large":"https://i.ebayimg.com/00/s/NjMwWDk2MA==/z/n58AAOSwp-RaZqhn/$_75.PNG","xlarge":"https://i.ebayimg.com/00/s/NjMwWDk2MA==/z/n58AAOSwp-RaZqhn/$_20.PNG"}],"needsPhoneOnReply":false,

以及这些类型的HTML

  <ul class="gallery__main-viewer-list">
                        <li class="gallery__main-viewer-item">
                                <span data-responsive-image="{
                                 maxWidth: 815,
                                 maxHeight: 600,
                                 small: 'https://i.ebayimg.com/00/s/NjAwWDU2Mg==/z/H~QAAOSwpaZbhugU/$_20.JPG',
                                 medium: 'https://i.ebayimg.com/00/s/NjAwWDU2Mg==/z/H~QAAOSwpaZbhugU/$_75.JPG',
                                 large: 'https://i.ebayimg.com/00/s/NjAwWDU2Mg==/z/H~QAAOSwpaZbhugU/$_20.JPG' }" title="" class="gallery__img-wrap current" data-index="1"><img id="responsive-image-1535889965732" src="https://i.ebayimg.com/00/s/NjAwWDU2Mg==/z/H~QAAOSwpaZbhugU/$_20.JPG" alt=""></span>
                            </li>
                        <li class="gallery__main-viewer-item">
                                <span data-responsive-image="{
                                    defer: 'true',
                                    small: 'https://i.ebayimg.com/00/s/NjAwWDU2Mg==/z/gBgAAOSw8Ftbhugb/$_20.JPG',
                                    medium: 'https://i.ebayimg.com/00/s/NjAwWDU2Mg==/z/gBgAAOSw8Ftbhugb/$_20.JPG',
                                    large: 'https://i.ebayimg.com/00/s/NjAwWDU2Mg==/z/gBgAAOSw8Ftbhugb/$_20.JPG' }" title="Floor Mats For Toyota Corolla Zre152R/Zre153R (Sedans) May 2007 - Darra Brisbane South West image 2" class="gallery__img-wrap current" data-index="1">
                                    <noscript><img src="https://i.ebayimg.com/00/s/NjAwWDU2Mg==/z/gBgAAOSw8Ftbhugb/$_74.JPG" alt="Floor Mats For Toyota Corolla Zre152R/Zre153R (Sedans) May 2007 - Darra Brisbane South West image 2"></noscript>
                                    </span>
                            </li>

提取这些xlarge网址的最佳方法是什么?谢谢

2 个答案:

答案 0 :(得分:2)

re.findall,后面有零宽度的正值:

re.findall(r'(?<="xlarge":")[^"]+', str_)
  • (?<="xlarge":")是一种零宽度正向后看模式,它在所需的匹配之前字面匹配"xlarge":",该匹配是一个或多个不是"[^"]+)的字符;本质上,[^"]+匹配下一个"

或分组:

re.findall(r'"xlarge":"([^"]+)', str_)
  • 类似于上面,但是在这里,我们将[^"]+放在捕获的组中,而不是向后看,re.findall将仅输出这些组

示例:

In [1507]: str_  = '"xlarge":"https://i.ebayimg.com/00/s/NTU5WDEwMjY=/z/TQMAAOSwkrFaZqhh/$_20.PNG"},{"small":"https://i.ebayimg.com/00/s/ODAwWDgwMA==/z/uX0AAOSwvGlaZqhU/$_35.JPG","large":"https://i.ebayim
      ...: g.com/00/s/ODAwWDgwMA==/z/uX0AAOSwvGlaZqhU/$_75.JPG","xlarge":"https://i.ebayimg.com/00/s/ODAwWDgwMA==/z/uX0AAOSwvGlaZqhU/$_20.JPG"},{"small":"https://i.ebayimg.com/00/s/NjMwWDk2MA==/z/n58AAOSw
      ...: p-RaZqhn/$_35.PNG","large":"https://i.ebayimg.com/00/s/NjMwWDk2MA==/z/n58AAOSwp-RaZqhn/$_75.PNG","xlarge":"https://i.ebayimg.com/00/s/NjMwWDk2MA==/z/n58AAOSwp-RaZqhn/$_20.PNG"}],"needsPhoneOnRe
      ...: ply":false,'

In [1508]: re.findall(r'(?<="xlarge":")[^"]+', str_)
Out[1508]: 
['https://i.ebayimg.com/00/s/NTU5WDEwMjY=/z/TQMAAOSwkrFaZqhh/$_20.PNG',
 'https://i.ebayimg.com/00/s/ODAwWDgwMA==/z/uX0AAOSwvGlaZqhU/$_20.JPG',
 'https://i.ebayimg.com/00/s/NjMwWDk2MA==/z/n58AAOSwp-RaZqhn/$_20.PNG']

In [1509]: re.findall(r'"xlarge":"([^"]+)', str_)
Out[1509]: 
['https://i.ebayimg.com/00/s/NTU5WDEwMjY=/z/TQMAAOSwkrFaZqhh/$_20.PNG',
 'https://i.ebayimg.com/00/s/ODAwWDgwMA==/z/uX0AAOSwvGlaZqhU/$_20.JPG',
 'https://i.ebayimg.com/00/s/NjMwWDk2MA==/z/n58AAOSwp-RaZqhn/$_20.PNG']

答案 1 :(得分:1)

您可以使用<!DOCTYPE html> <html> <head> <meta name="viewport" content="width=device-width, initial-scale=1"> </head> <body> <div class="header"> <h2>Blog Name</h2> </div> <div class="row"> <div class="leftcolumn"> <button id="slideToggle">slideToggle</button> <div class="card" id="firstCard"> <h2>FIRST TITLE HEADING</h2> <h5>Title description, Dec 7, 2018</h5> <div class="fakeimg" style="height:200px;">Image</div> <p>Some text..</p> <p>Sunt in culpa qui officia deserunt mollit anim id est laborum consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco.</p> </div> <div class="card"> <h2>SECOND TITLE HEADING</h2> <h5>Title description, Dec 7, 2018</h5> <div class="fakeimg" style="height:200px;">Image</div> <p>Some text..</p> <p>Sunt in culpa qui officia deserunt mollit anim id est laborum consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco.</p> </div> </div> </div> <div class="footer"> <h2>Footer</h2> </div> </body> </html>使用正则表达式re.findall提取所有xlarge网址

r'"xlarge":"(.*?)"