Python代码仅适用于标题标记而不适用于表格

时间:2016-07-18 16:06:46

标签: python python-2.7

在正则表达式编写<title>(.+?)</title>时,它正在工作,但当此标题标记更改为<table>(.+?)</table>时,它会将'[]'(方括号)作为输出。 我的代码是:

import urllib
import re

urls = ["http://physics.iitd.ac.in/content/list-faculty-members", "http://www.iitkgp.ac.in/commdir3/list.php?division=3&deptcode=ME","http://www.iitkgp.ac.in/commdir3/list.php?division=3&deptcode=CE"]
i = 0
regex = '<table>(.+?)</table>'
pattern = re.compile(regex)

while i< len(urls):
    htmlfile = urllib.urlopen(urls[i])
    htmltext = htmlfile.read()
    tables  = re.findall(pattern,htmltext)

    print tables
    i+=1

1 个答案:

答案 0 :(得分:1)

使用BeautifulSoup

import urllib
import re

from BeautifulSoup import BeautifulSoup as bs

urls = ["http://physics.iitd.ac.in/content/list-faculty-members", 
        "http://www.iitkgp.ac.in/commdir3/list.php?division=3&deptcode=ME", 
        "http://www.iitkgp.ac.in/commdir3/list.php?division=3&deptcode=CE"]
i = 0

while i < len(urls):
    htmlfile = urllib.urlopen(urls[i])
    htmltext = htmlfile.read()
    soup = bs(htmltext)
    tables = soup.find_all('table')

    print tables
    i+=1