我对python比较陌生,所以这样的事情对我来说并不容易。
我只想循环浏览网页内容,然后将每个事件打印到控制台窗口,但我显然错了。
import sys
import re
import urllib2
import urlparse
crawling = tocrawl.pop()
response = urllib2.urlopen(crawling)
msg = response.read()
endDiv = msg.find('</div>')
while endDiv != -1:
endDiv = msg.find('</div>')
startPos = msg.find('class="facultyname">', endDiv)
if startPos != -1:
nextPos = msg.find('.php">', startPos)
endPos = msg.find('</a>', nextPos)
if endPos != -1:
name = msg[nextPos+6:endPos]
print name, " ",
startPos = msg.find('function escramble()')
if startPos != -1:
nextPos = msg.find('b=', startPos)
endPos = msg.find('c', nextPos)
if endPos != -1:
email = msg[nextPos+3:endPos-1]
email = email[:-13] + '@email.com'
print email
endDiv = msg.find('</div>', endPos)
我已经抓住了第一次出现,我只想循环到页面的末尾并收集其余部分。
示例HTML:
<div id="main-text">
<p class="title">Research Scientists</p>
<div class="space"> </div>
<img src="photos/icons/bastolaicon.jpg" class="faculty" width="53" height="71" alt="Bastola Photo" />
<div class="facultyname">
<strong><a href="people/bastola.php">person1</a>
<br /><em>Post-Doctoral Scientist</em></strong>
<br />
</div>
<div class="facultybody">
Rm. 218A
<br /><em><script type="text/javascript">
<!--
function escramble(){
var a,b,c,d,e,f,g,h,i
a='<a href=\"mai'
b='person1'
c='\">'
a+='lto:'
b+='@'
e='</a>'
f=''
b+='email.com'
g='<img src=\"'
h=''
i='\" alt="Email us." border="0">'
if (f) d=f
else if (h) d=g+h+i
else d=b
document.write(a+b+c+d+e)
}
escramble()
//-->
</script></em>
</div>
<div class="space"> </div>
<img src="photos/icons/person2icon.jpg" class="faculty" width="53" height="71" alt="person2 Photo" />
<div class="facultyname">
<strong><a href="people/person2.shtml">person2</a>
<br /><em>Assistant Research Scientist</em></strong>
<br />
</div>
<div class="facultybody">
Rm. 227
<br />(850) 645-1253
<br /><em><script type="text/javascript">
<!--
function escramble(){
var a,b,c,d,e,f,g,h,i
a='<a href=\"mai'
b='person2'
c='\">'
a+='lto:'
b+='@'
e='</a>'
f=''
b+='email.com'
g='<img src=\"'
h=''
i='\" alt="Email us." border="0">'
if (f) d=f
else if (h) d=g+h+i
else d=b
document.write(a+b+c+d+e)
}
escramble()
//-->
</script></em>
</div>
<div class="spacer"> </div>
答案 0 :(得分:0)
适用于您的样本数据的快速而肮脏的方法:
>>> res = re.findall(r"b\+?='(.*?)'", html)
>>> res
['person1', '@', 'email.com', 'person2', '@', 'email.com']
>>> emails [''.join(group) for group in zip(*[iter(res)]*3)]
['person1@email.com', 'person2@email.com']
由于这已经很可怕了,所以让我们真正克服它:
>>> names = [name.split('>', 1)[1] for name in re.findall(r'href="people(.*?)</a>', html)]
>>> names
['person1', 'person2']
>>> zip(names, emails)
[('person1', 'person1@email.com'), ('person2', 'person2@email.com')]
注意 - 这只适用于您的示例数据 - HTML是变幻无常的 - 所以不要指望它是健壮的 - 易于管理等等......