我试图在网站上废弃一些页面,这里是代码示例
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/static/favicon-f8a3a024b0.ico" rel="shortcut icon"/>
<link href="/opensearch_ggs.xml" rel="search" title="WEBSITE anime GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_ggs2.xml" rel="search" title="WEBSITE music GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_artists.xml" rel="search" title="WEBSITE artists" type="application/opensearchdescription+xml"/>
<link href="/opensearch_requests.xml" rel="search" title="WEBSITE requests" type="application/opensearchdescription+xml"/>
<link href="/opensearch_forums.xml" rel="search" title="WEBSITE forums" type="application/opensearchdescription+xml"/>
<link href="/opensearch_users.xml" rel="search" title="WEBSITE users" type="application/opensearchdescription+xml"/>
<link href="/feed/rss_ggs_all/GOODSTUFF" rel="alternate" title="WEBSITE - All GG" type="application/rss+xml"/>
<link href="/feed/rss_ggs_anime/GOODSTUFF" rel="alternate" title="WEBSITE - Anime GG" type="application/rss+xml"/>
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>
以下是我与
合作的内容 for x in range(pages):
pagen += 1
url3 = url2[:40] + str(pagen) + url2[41:]
print "url3 = ", url3
ggs = br.open(url3)
#print "ggs = ", ggs.read()
soup = BeautifulSoup(ggs, "lxml")
print "soup = ", soup
trueurl = 'https://WEBSITE.tv'
#print trueurl
# Finds the gg links
download = soup.find_all(href=re.compile("GOODSTUFF"))
# print "download = ", download
#print 'download'
# For-Loop to download the ggs
for link in download:
sleep(10)
print 'loop'
gglink = link.get('href')
gglink = trueurl + gglink
print gglink
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
req = urllib2.Request(gglink, headers=hdr)
print req
res_init()
res = urllib2.urlopen(req)
#print res
directory = "/home/cyber/yen/" # gg directory, change as you please.
file += 1
print "Page", pagen, "of", pageout, ".....", file, 'ggs downloaded'
urllib.urlretrieve(gglink, directory + 'page' + str(pagen) + '_gg' + str(file) + ".gg")
我只想下载
https://WEBSITE.tv/GG/223197/download/GOODSTUFF
但它也抓住了
/进料/ rss_ggs_anime / GOODSTUFF
我不想那样。
问题是findall与GOODSTUFF相匹配,我试图减轻它但是这样做
for download in soup.find_all(href=re.compile("GOODSTUFF")):
if download.find("feed"):
continue
并且它没有捕获任何东西,尝试过rss而不是提供一些结果
答案 0 :(得分:0)
如果html元素总是像上面粘贴的那样,你可以尝试这样:
html="""
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/static/favicon-f8a3a024b0.ico" rel="shortcut icon"/>
<link href="/opensearch_ggs.xml" rel="search" title="WEBSITE anime GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_ggs2.xml" rel="search" title="WEBSITE music GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_artists.xml" rel="search" title="WEBSITE artists" type="application/opensearchdescription+xml"/>
<link href="/opensearch_requests.xml" rel="search" title="WEBSITE requests" type="application/opensearchdescription+xml"/>
<link href="/opensearch_forums.xml" rel="search" title="WEBSITE forums" type="application/opensearchdescription+xml"/>
<link href="/opensearch_users.xml" rel="search" title="WEBSITE users" type="application/opensearchdescription+xml"/>
<link href="/feed/rss_ggs_all/GOODSTUFF" rel="alternate" title="WEBSITE - All GG" type="application/rss+xml"/>
<link href="/feed/rss_ggs_anime/GOODSTUFF" rel="alternate" title="WEBSITE - Anime GG" type="application/rss+xml"/>
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
for link in soup.select(".download_link a"):
print(link['href'])
答案 1 :(得分:0)
在这种情况下,您只需要修改正则表达式。
当您编写re.compile("GOODSTUFF")
时,它将匹配包含GOODSTUFF
的所有内容作为子字符串。
我建议您将正则表达式修改为:
re.compile("http(?:s)://(.*)/GOODSTUFF")
上面的正则表达式会给你你想要的输出如下(只有两个带有下载链接的标签):
[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>, <a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]
完整片段:
html = """<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/static/favicon-f8a3a024b0.ico" rel="shortcut icon"/>
<link href="/opensearch_ggs.xml" rel="search" title="WEBSITE anime GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_ggs2.xml" rel="search" title="WEBSITE music GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_artists.xml" rel="search" title="WEBSITE artists" type="application/opensearchdescription+xml"/>
<link href="/opensearch_requests.xml" rel="search" title="WEBSITE requests" type="application/opensearchdescription+xml"/>
<link href="/opensearch_forums.xml" rel="search" title="WEBSITE forums" type="application/opensearchdescription+xml"/>
<link href="/opensearch_users.xml" rel="search" title="WEBSITE users" type="application/opensearchdescription+xml"/>
<link href="/feed/rss_ggs_all/GOODSTUFF" rel="alternate" title="WEBSITE - All GG" type="application/rss+xml"/>
<link href="/feed/rss_ggs_anime/GOODSTUFF" rel="alternate" title="WEBSITE - Anime GG" type="application/rss+xml"/>
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>"""
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, "lxml")
download_links = soup.find_all(href=re.compile("http(?:s)://(.*)/GOODSTUFF"))
for link in download_links:
# your download code here
# download(link)
此外,只使用正则表达式,如果使用:
,您可以在不使用BeautifulSoup的情况下直接获取内容download_links = [i[0] for i in re.findall("(http(?:s)://(.*)/GOODSTUFF)", html)]
以上行的结果将是:
['https://WEBSITE.tv/GG/223197/download/GOODSTUFF', 'https://WEBSITE.tv/GG/223197/download/GOODSTUFF']