是否可以从hdarea.org中提取标题和下载链接(“已上传”)
继承人code example
那是迄今为止的。
import urllib2
from BeautifulSoup import BeautifulSoup
import re
page = urllib2.urlopen("http://hd-area.org").read()
soup = BeautifulSoup(page)
for title in soup.findAll("div", {"class" : "title"}):
print (title.getText())
for a in soup.findAll('a'):
if 'Uploaded.net' in a:
print a['href']
它已经提取了标题。
但我发现了应该提取链接的地方。
它提取但不正确...
任何建议如何确保脚本首先检查“div”和“link”是否在此div类中"<div class="topbox">"
修改
现在我已经完成了
这是最终代码
谢谢你们 - 让我朝着正确的方向前进
import urllib2
from BeautifulSoup import BeautifulSoup
import datetime
import PyRSS2Gen
print "top_rls"
page = urllib2.urlopen("http://hd-area.org/index.php?s=Cinedubs").read()
soup = BeautifulSoup(page)
movieTit = []
movieLink = []
for title in soup.findAll("div", {"class" : "title"}):
movieTit.append(title.getText())
for span in soup.findAll('span', attrs={"style":"display:inline;"},recursive=True):
for a in span.findAll('a'):
if 'ploaded' in a.getText():
movieLink.append(a['href'])
elif 'cloudzer' in a.getText():
movieLink.append(a['href'])
for i in range(len(movieTit)):
print movieTit[i]
print movieLink[i]
rss = PyRSS2Gen.RSS2(
title = "HD-Area Cinedubs",
link = "http://hd-area.org/index.php?s=Cinedubs",
description = " "
" ",
lastBuildDate = datetime.datetime.now(),
items = [
PyRSS2Gen.RSSItem(
title = movieTit[0],
link = movieLink[0]),
PyRSS2Gen.RSSItem(
title = movieTit[1],
link = movieLink[1]),
PyRSS2Gen.RSSItem(
title = movieTit[2],
link = movieLink[2]),
PyRSS2Gen.RSSItem(
title = movieTit[3],
link = movieLink[3]),
PyRSS2Gen.RSSItem(
title = movieTit[4],
link = movieLink[4]),
PyRSS2Gen.RSSItem(
title = movieTit[5],
link = movieLink[5]),
PyRSS2Gen.RSSItem(
title = movieTit[6],
link = movieLink[6]),
PyRSS2Gen.RSSItem(
title = movieTit[7],
link = movieLink[7]),
PyRSS2Gen.RSSItem(
title = movieTit[8],
link = movieLink[8]),
PyRSS2Gen.RSSItem(
title = movieTit[9],
link = movieLink[9]),
])
rss.write_xml(open("cinedubs.xml", "w"))
答案 0 :(得分:0)
就像这样:
movieTit = []
movieLink = []
for title in soup.findAll("div", {"class" : "title"}):
movieTit.append(title.getText())
for a in soup.findAll('a'):
if 'ploaded' in a.getText():
movieLink.append(a['href'])
for i in range(0,len(movieTit)/2,2):
print movieTit[i]
print movieTit[i+1]
print movieLink[i]
print movieLink[i+1]
答案 1 :(得分:0)
首先找到所有
的一个建议<div class="topbox">
如果页面中有多个此页面。你可以像这样使用find_all函数或find:
soup = BeautifulSoup(page)
# in case you want to find all of them
for item in soup.find_all('div', _class='topbox'):
# in this line you have to check where is the title : <span>, <a> or other
# check if the tag exist or not
if item.span is not None:
title = item.span.text
# the same for this
if item.a is not None:
link = item.a['href']
我在页面中找不到您想要的div。如果您需要,请告诉我您想要的确切。