我想在社交书签网站中提及所有* .wordpress网址。页面中的网址采用以下格式:
<span class="domain">somedomain.com </span>
以下是我提出的建议:
import os
import urllib2
import re
from os.path import basename
from urlparse import urlsplit
import time
baseurl = 'https://targetwebsite/pages/'
print baseurl
spage = int(raw_input("Start page?"))
epage = int(raw_input("End page?"))
for p in range (spage, epage):
url= baseurl+ str(p)
print url
urlContent = urllib2.urlopen(url).read()
#WHAT REGEXP HERE?
domainUrls = re.findall('span .*.wordpress.com (.*?) ', urlContent)
try:
for dUrl in domainUrls:
print dUrl
except:
print "an error occured"
pass
我尝试了不同的正则表达式但没有效果。感谢您的帮助。
答案 0 :(得分:0)
一个宽松的答案就是
([^ ]*\.)?wordpress.com(\/[^ ]*)?