在正则表达式编写<title>(.+?)</title>
时,它正在工作,但当此标题标记更改为<table>(.+?)</table>
时,它会将'[]'(方括号)作为输出。
我的代码是:
import urllib
import re
urls = ["http://physics.iitd.ac.in/content/list-faculty-members", "http://www.iitkgp.ac.in/commdir3/list.php?division=3&deptcode=ME","http://www.iitkgp.ac.in/commdir3/list.php?division=3&deptcode=CE"]
i = 0
regex = '<table>(.+?)</table>'
pattern = re.compile(regex)
while i< len(urls):
htmlfile = urllib.urlopen(urls[i])
htmltext = htmlfile.read()
tables = re.findall(pattern,htmltext)
print tables
i+=1
答案 0 :(得分:1)
import urllib
import re
from BeautifulSoup import BeautifulSoup as bs
urls = ["http://physics.iitd.ac.in/content/list-faculty-members",
"http://www.iitkgp.ac.in/commdir3/list.php?division=3&deptcode=ME",
"http://www.iitkgp.ac.in/commdir3/list.php?division=3&deptcode=CE"]
i = 0
while i < len(urls):
htmlfile = urllib.urlopen(urls[i])
htmltext = htmlfile.read()
soup = bs(htmltext)
tables = soup.find_all('table')
print tables
i+=1