Python 2.7正则表达式查找特定的html代码

时间:2014-07-14 20:51:45

标签: html regex parsing python-2.7 web-scraping

所以,我从教授那里得到了另一个任务,而这个任务正在困扰我。

  

编写一个可以复制这种交换的程序:

`What is the url to the course page? http://registrar.indiana.edu/browser/soc4148/index.shtml
Searching for departments on: http://registrar.indiana.edu/browser/soc4148/index.shtml
Departments loaded. List them (Y/N)? y
AAAD    AAAD/index.shtml
AADM    AADM/index.shtml
AAST    AAST/index.shtml
ABEH    ABEH/index.shtml
AERO    AERO/index.shtml
AFRI    AFRI/index.shtml
AMID    AMID/index.shtml
AMST    AMST/index.shtml
ANAT    ANAT/index.shtml
ANTH    ANTH/index.shtml
ASCS    ASCS/index.shtml
AST     AST/index.shtml
BIOC    BIOC/index.shtml
BIOL    BIOL/index.shtml
BIOT    BIOT/index.shtml
[too many to list at once]
TEL     TEL/index.shtml
THTR    THTR/index.shtml
TOPT    TOPT/index.shtml
VSCI    VSCI/index.shtml

What department would you like? aero
Looking for courses on: http://registrar.indiana.edu/browser/soc4148/AERO/index.shtml

AERO-A101.shtml     AERO-A 101     INTRO TO THE AIR FORCE TODAY
AERO-A301.shtml     AERO-A 301     AIR FORCE LEADERSHIP STUDIES


What is the url to the course page? http://registrar.indiana.edu/browser/soc4148/index.shtml
Searching for departments on: http://registrar.indiana.edu/browser/soc4148/index.shtml
Departments loaded. List them (Y/N)?n
What department would you like? INFO
Looking for courses on: http://registrar.indiana.edu/browser/soc4148/INFO/index.shtml
...
...
...`

代码重复。我们只允许使用列表推导和正则表达式,以及urllib和os库。

如果您查看网页的源代码,您会发现链接列出如下:

  

第一页(部门清单)

'<strong><a href="AAAD/index.shtml">AAAD</a></strong> African Am & Afri Diaspora Std<br />
<strong><a href="AADM/index.shtml">AADM</a></strong> Arts Administration<br />
<strong><a href="AAST/index.shtml">AAST</a></strong> Asian American Studies<br />
<strong><a href="ABEH/index.shtml">ABEH</a></strong> Animal Behavior<br />
<strong><a href="AERO/index.shtml">AERO</a></strong> Aerospace Studies<br />
<strong><a href="AFRI/index.shtml">AFRI</a></strong> African Studies<br />
<strong><a href="AMID/index.shtml">AMID</a></strong> Apparel Merch/Int Design<br />
<strong><a href="AMST/index.shtml">AMST</a></strong> American Studies<br />
<strong><a href="ANAT/index.shtml">ANAT</a></strong> Anatomy<br />'

而且,非常相似:

  

第二页(课程列表)

'<td>
<strong><a href="AERO-A101.shtml">AERO-A 101</a></strong> INTRO TO THE AIR FORCE TODAY<br>
<strong><a href="AERO-A201.shtml">AERO-A 201</a></strong> EVOLUTION USAF AIR & SPACE PWR<br>
<strong><a href="AERO-A301.shtml">AERO-A 301</a></strong> AIR FORCE LEADERSHIP STUDIES<br>
</td>
<td>
<strong><a href="AERO-A401.shtml">AERO-A 401</a></strong> NATL SEC AFFRS/PREP ACTV DUTY<br>
</td>'

为了省去我必须打开网页并获取内容的功能(使用.read()以便将代码作为一个大字符串)。我能够使用这些正则表达式和循环来获取部门的链接(代码的第一部分)。

'print "Searching for departments on: " + url
links = [item for item in re.findall('<strong><a href="[\w./-]+">', contents)]

names = [item for item in re.findall('.shtml">[\w]+[-]?[\w]?[-]?[ ]?[\d]?[\d]?[\d]?</a></strong>', contents)]

descripts = [item for item in re.findall('</a></strong>[\s][\w\s]+[&/-]?[\w\s]+<br', contents)]

for link in links:
    link = link.replace('<strong><a href="', '')
    link = link.replace('">', '')
    the_links.append(link)

for name in names:
    name = name.replace('.shtml">', '')
    name = name.replace('</a></strong>', '')
    the_names.append(name)

for each in descripts:
    each = each.replace('</a></strong> ', '')
    each = each.replace('<br', '')
    the_descripts.append(each)

while True:
    list_deps = raw_input("List them (Y/N)? ")
    if list_deps.lower() == "y":
        for i in range(len(links)):
            if len(the_links[i]) < 16:
                print the_names[i] + "\t" + the_links[i] + "\t\t" + the_descripts[i]
            else:
                print the_names[i] + "\t" + the_links[i] + "\t" + the_descripts[i]
        break
    elif list_deps.lower() == "n":
        break
    else:
        print "You must enter Y or N."'

但是,当我尝试在第二部分使用相同的正则表达式(获取课程链接,课程名称和课程描述)时,我的问题就出现了。

有人可以找到我出错的地方吗?如果需要我的全部代码,我可以发布或发送给希望拥有它的人。我只是不确定为什么我的正则表达式可用于获取部门而不是课程。

  

编辑:结果我使用用户的输入.lower()转到相应的部门页面,但是用户输入需要是.upper()或者是404页面。

0 个答案:

没有答案