所以,我从教授那里得到了另一个任务,而这个任务正在困扰我。
编写一个可以复制这种交换的程序:
`What is the url to the course page? http://registrar.indiana.edu/browser/soc4148/index.shtml
Searching for departments on: http://registrar.indiana.edu/browser/soc4148/index.shtml
Departments loaded. List them (Y/N)? y
AAAD AAAD/index.shtml
AADM AADM/index.shtml
AAST AAST/index.shtml
ABEH ABEH/index.shtml
AERO AERO/index.shtml
AFRI AFRI/index.shtml
AMID AMID/index.shtml
AMST AMST/index.shtml
ANAT ANAT/index.shtml
ANTH ANTH/index.shtml
ASCS ASCS/index.shtml
AST AST/index.shtml
BIOC BIOC/index.shtml
BIOL BIOL/index.shtml
BIOT BIOT/index.shtml
[too many to list at once]
TEL TEL/index.shtml
THTR THTR/index.shtml
TOPT TOPT/index.shtml
VSCI VSCI/index.shtml
What department would you like? aero
Looking for courses on: http://registrar.indiana.edu/browser/soc4148/AERO/index.shtml
AERO-A101.shtml AERO-A 101 INTRO TO THE AIR FORCE TODAY
AERO-A301.shtml AERO-A 301 AIR FORCE LEADERSHIP STUDIES
What is the url to the course page? http://registrar.indiana.edu/browser/soc4148/index.shtml
Searching for departments on: http://registrar.indiana.edu/browser/soc4148/index.shtml
Departments loaded. List them (Y/N)?n
What department would you like? INFO
Looking for courses on: http://registrar.indiana.edu/browser/soc4148/INFO/index.shtml
...
...
...`
代码重复。我们只允许使用列表推导和正则表达式,以及urllib和os库。
如果您查看网页的源代码,您会发现链接列出如下:
第一页(部门清单)
'<strong><a href="AAAD/index.shtml">AAAD</a></strong> African Am & Afri Diaspora Std<br />
<strong><a href="AADM/index.shtml">AADM</a></strong> Arts Administration<br />
<strong><a href="AAST/index.shtml">AAST</a></strong> Asian American Studies<br />
<strong><a href="ABEH/index.shtml">ABEH</a></strong> Animal Behavior<br />
<strong><a href="AERO/index.shtml">AERO</a></strong> Aerospace Studies<br />
<strong><a href="AFRI/index.shtml">AFRI</a></strong> African Studies<br />
<strong><a href="AMID/index.shtml">AMID</a></strong> Apparel Merch/Int Design<br />
<strong><a href="AMST/index.shtml">AMST</a></strong> American Studies<br />
<strong><a href="ANAT/index.shtml">ANAT</a></strong> Anatomy<br />'
而且,非常相似:
第二页(课程列表)
'<td>
<strong><a href="AERO-A101.shtml">AERO-A 101</a></strong> INTRO TO THE AIR FORCE TODAY<br>
<strong><a href="AERO-A201.shtml">AERO-A 201</a></strong> EVOLUTION USAF AIR & SPACE PWR<br>
<strong><a href="AERO-A301.shtml">AERO-A 301</a></strong> AIR FORCE LEADERSHIP STUDIES<br>
</td>
<td>
<strong><a href="AERO-A401.shtml">AERO-A 401</a></strong> NATL SEC AFFRS/PREP ACTV DUTY<br>
</td>'
为了省去我必须打开网页并获取内容的功能(使用.read()以便将代码作为一个大字符串)。我能够使用这些正则表达式和循环来获取部门的链接(代码的第一部分)。
'print "Searching for departments on: " + url
links = [item for item in re.findall('<strong><a href="[\w./-]+">', contents)]
names = [item for item in re.findall('.shtml">[\w]+[-]?[\w]?[-]?[ ]?[\d]?[\d]?[\d]?</a></strong>', contents)]
descripts = [item for item in re.findall('</a></strong>[\s][\w\s]+[&/-]?[\w\s]+<br', contents)]
for link in links:
link = link.replace('<strong><a href="', '')
link = link.replace('">', '')
the_links.append(link)
for name in names:
name = name.replace('.shtml">', '')
name = name.replace('</a></strong>', '')
the_names.append(name)
for each in descripts:
each = each.replace('</a></strong> ', '')
each = each.replace('<br', '')
the_descripts.append(each)
while True:
list_deps = raw_input("List them (Y/N)? ")
if list_deps.lower() == "y":
for i in range(len(links)):
if len(the_links[i]) < 16:
print the_names[i] + "\t" + the_links[i] + "\t\t" + the_descripts[i]
else:
print the_names[i] + "\t" + the_links[i] + "\t" + the_descripts[i]
break
elif list_deps.lower() == "n":
break
else:
print "You must enter Y or N."'
但是,当我尝试在第二部分使用相同的正则表达式(获取课程链接,课程名称和课程描述)时,我的问题就出现了。
有人可以找到我出错的地方吗?如果需要我的全部代码,我可以发布或发送给希望拥有它的人。我只是不确定为什么我的正则表达式可用于获取部门而不是课程。
编辑:结果我使用用户的输入.lower()转到相应的部门页面,但是用户输入需要是.upper()或者是404页面。