beautifulsoup findall除了某些文字

时间:2017-10-16 04:17:59

标签: python python-2.7 web-scraping beautifulsoup

我试图在网站上废弃一些页面,这里是代码示例

<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/static/favicon-f8a3a024b0.ico" rel="shortcut icon"/>
<link href="/opensearch_ggs.xml" rel="search" title="WEBSITE anime GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_ggs2.xml" rel="search" title="WEBSITE music GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_artists.xml" rel="search" title="WEBSITE artists" type="application/opensearchdescription+xml"/>
<link href="/opensearch_requests.xml" rel="search" title="WEBSITE requests" type="application/opensearchdescription+xml"/>
<link href="/opensearch_forums.xml" rel="search" title="WEBSITE forums" type="application/opensearchdescription+xml"/>
<link href="/opensearch_users.xml" rel="search" title="WEBSITE users" type="application/opensearchdescription+xml"/>
<link href="/feed/rss_ggs_all/GOODSTUFF" rel="alternate" title="WEBSITE - All GG" type="application/rss+xml"/>
<link href="/feed/rss_ggs_anime/GOODSTUFF" rel="alternate" title="WEBSITE - Anime GG" type="application/rss+xml"/>
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>  
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>  

以下是我与

合作的内容
        for x in range(pages):
                pagen += 1
                url3 = url2[:40] + str(pagen) + url2[41:]
                print "url3 = ", url3
                ggs = br.open(url3)
                #print "ggs = ", ggs.read()
                soup = BeautifulSoup(ggs, "lxml")
                print "soup = ", soup
                trueurl = 'https://WEBSITE.tv'
                #print trueurl

                        # Finds the gg links

                download = soup.find_all(href=re.compile("GOODSTUFF"))
#               print "download = ", download
                #print 'download'
                        # For-Loop to download the ggs

                for link in download:
                        sleep(10)
                        print 'loop'
                        gglink = link.get('href')
                        gglink = trueurl + gglink
                        print gglink
                        hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
                                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
                        req = urllib2.Request(gglink, headers=hdr)
                        print req
                        res_init()
                        res = urllib2.urlopen(req)
                        #print res
                        directory = "/home/cyber/yen/" # gg directory, change as you please.
                        file += 1
                        print "Page", pagen, "of", pageout, ".....", file, 'ggs downloaded'
                        urllib.urlretrieve(gglink, directory + 'page' + str(pagen) + '_gg' + str(file) + ".gg")

我只想下载

https://WEBSITE.tv/GG/223197/download/GOODSTUFF

但它也抓住了

/进料/ rss_ggs_anime / GOODSTUFF

我不想那样。

问题是findall与GOODSTUFF相匹配,我试图减轻它但是这样做

                for download in soup.find_all(href=re.compile("GOODSTUFF")):
                    if download.find("feed"):
                        continue

并且它没有捕获任何东西,尝试过rss而不是提供一些结果

2 个答案:

答案 0 :(得分:0)

如果html元素总是像上面粘贴的那样,你可以尝试这样:

html="""
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/static/favicon-f8a3a024b0.ico" rel="shortcut icon"/>
<link href="/opensearch_ggs.xml" rel="search" title="WEBSITE anime GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_ggs2.xml" rel="search" title="WEBSITE music GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_artists.xml" rel="search" title="WEBSITE artists" type="application/opensearchdescription+xml"/>
<link href="/opensearch_requests.xml" rel="search" title="WEBSITE requests" type="application/opensearchdescription+xml"/>
<link href="/opensearch_forums.xml" rel="search" title="WEBSITE forums" type="application/opensearchdescription+xml"/>
<link href="/opensearch_users.xml" rel="search" title="WEBSITE users" type="application/opensearchdescription+xml"/>
<link href="/feed/rss_ggs_all/GOODSTUFF" rel="alternate" title="WEBSITE - All GG" type="application/rss+xml"/>
<link href="/feed/rss_ggs_anime/GOODSTUFF" rel="alternate" title="WEBSITE - Anime GG" type="application/rss+xml"/>
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>  
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"lxml")
for link in soup.select(".download_link a"):
    print(link['href'])

答案 1 :(得分:0)

在这种情况下,您只需要修改正则表达式。 当您编写re.compile("GOODSTUFF")时,它将匹配包含GOODSTUFF的所有内容作为子字符串。

我建议您将正则表达式修改为:

re.compile("http(?:s)://(.*)/GOODSTUFF")

上面的正则表达式会给你你想要的输出如下(只有两个带有下载链接的标签):

[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>, <a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]

完整片段:

html = """<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/static/favicon-f8a3a024b0.ico" rel="shortcut icon"/>
<link href="/opensearch_ggs.xml" rel="search" title="WEBSITE anime GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_ggs2.xml" rel="search" title="WEBSITE music GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_artists.xml" rel="search" title="WEBSITE artists" type="application/opensearchdescription+xml"/>
<link href="/opensearch_requests.xml" rel="search" title="WEBSITE requests" type="application/opensearchdescription+xml"/>
<link href="/opensearch_forums.xml" rel="search" title="WEBSITE forums" type="application/opensearchdescription+xml"/>
<link href="/opensearch_users.xml" rel="search" title="WEBSITE users" type="application/opensearchdescription+xml"/>
<link href="/feed/rss_ggs_all/GOODSTUFF" rel="alternate" title="WEBSITE - All GG" type="application/rss+xml"/>
<link href="/feed/rss_ggs_anime/GOODSTUFF" rel="alternate" title="WEBSITE - Anime GG" type="application/rss+xml"/>
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>  
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>"""

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, "lxml")
download_links = soup.find_all(href=re.compile("http(?:s)://(.*)/GOODSTUFF"))
for link in download_links:
    # your download code here
    # download(link)

此外,只使用正则表达式,如果使用:

,您可以在不使用BeautifulSoup的情况下直接获取内容
download_links = [i[0] for i in re.findall("(http(?:s)://(.*)/GOODSTUFF)", html)]

以上行的结果将是:

['https://WEBSITE.tv/GG/223197/download/GOODSTUFF', 'https://WEBSITE.tv/GG/223197/download/GOODSTUFF']