本质上,脚本将从wallbase.cc的随机和顶级页面下载图像。基本上它寻找一个7位数的字符串,将每个图像标识为该图像。它是id到url并下载它的输入。我似乎唯一的问题是隔离7位数字符串。
我希望能做的是......
搜索<div id="thumbxxxxxxx"
,然后将xxxxxxx
分配给变量。
这是我到目前为止所拥有的。
import urllib
import os
import sys
import re
#Written in Python 2.7 with LightTable
def get_id():
import urllib.request
req = urllib.request.Request('http://wallbase.cc/'+initial_prompt)
response = urllib.request.urlopen(req)
the_page = response.read()
for "data-id="" in the_page
def toplist():
#We need to define how to find the images to download
#The idea is to go to http://wallbase.cc/x and to take all of strings containing <a href="http://wallbase.cc/wallpaper/xxxxxxx" </a>
#And to request the image file from that URL.
#Then the file will be put in a user defined directory
image_id = raw_input("Enter the seven digit identifier for the image to be downloaded to "+ directory+ "...\n>>> ")
f = open(directory+image_id+ '.jpg','wb')
f.write(urllib.urlopen('http://wallpapers.wallbase.cc/rozne/wallpaper-'+image_id+'.jpg').read())
f.close()
directory = raw_input("Enter the directory in which the images will be downloaded.\n>>> ")
initial_prompt = input("What do you want to download from?\n\t1: Toplist\n\t2: Random\n>>> ")
if initial_prompt == 1:
urlid = 'toplist'
toplist()
elif initial_prompt == 2:
urlid = 'random'
random()
非常感谢任何/所有帮助:)
答案 0 :(得分:3)
您可能想要使用像BeautifulSoup这样的网络抓取库,请参阅例如。 this SO question在Python上进行网页抓取。
import urllib2
from BeautifulSoup import BeautifulSoup
# download and parse HTML
url = 'http://wallbase.cc/toplist'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
# find the links we want
links = soup('a', href=re.compile('^http://wallbase.cc/wallpaper/\d+$'))
for l in links:
href = l.get('href')
print href # u'http://wallbase.cc/wallpaper/1750539'
print href.split('/')[-1] # u'1750539'
答案 1 :(得分:0)
如果您只想使用默认库,则可以使用正则表达式。
pattern = re.compile(r'<div id="thumb(.{7})"')
...
for data-id in re.findall(pattern, the_page):
pass # do something with data-id