Question

我很好的CS老师给了我一个甜蜜的暑假工作 - 建立一个自动＆＃34;维基游戏＆＃34; - 你给它两个页面作为参数，它找到两者之间的最短路径.. 无论如何，我使用urllib，urllib2和re模块。所以我用google搜索＆＃34;如何从python中获取html的所有链接＆＃34;或类似的东西，发现这个：

links = re.findall('"((http|ftp)s?://.*?)"', html)

它适用于其他链接 - 每个链接但维基百科。除了当前页面之外，它似乎无法在wiki中找到任何内容。

我项目的整个代码＆＃39;如果你想检查它（它还没有完成，它不是游戏 - 我现在只打印网页）：

import urllib,urllib2,re

def wikiexists (_strvalue):
    errorr='Wikipedia does not'
    _strvalue= _strvalue.replace(" ","_")
    try:
        page=urllib2.urlopen(('http://en.wikipedia.org/wiki/%s') % (_strvalue,))
        return True
    except:
        return False

def openwikiurl (_string):
    _string= _string.replace(" ","_")
    page=urllib2.urlopen(('http://en.wikipedia.org/wiki/%s') % (_string,))
    return page

def DaGame (start,end,maxnum):
    if wikiexists(start)==False or wikiexists(end)==False:
        print "One of your pages doesn't exist!"
    else:
        shortest (openwikiurl(start),openwikiurl(end),0,maxnum)

def shortest (current,target,now,maxnumber):
    if now>maxnumber:
        print "sorry too many attempts"
    if current is target:
        print """The target page is found!!!
                 Shortest path: """,now
    else:

        html=current.read()
        links = re.findall('"((http|ftp)s?://.*?)"', html)
        matches=filter (removestuff,links)
        print matches

def removestuff (tuplez):
        return True if "http://en.wikipedia.org/wiki/" in tuplez[0] else False


DaGame ('Florida','USA',5)

btw in＆＃34; def shortest（）：＆＃34;我试图打印＆＃34;链接＆＃34;而且不仅是它们的过滤版本，而且它也没有给我我想要的东西..

非常感谢

Answer 1

它不起作用，因为Wiki页面上的链接是相对的（因此它们不以http开头）。所以你可以做以下两件事之一：编写一个可以检测所有<a href="/some/relative/url"...元素的正则表达式（并从那里捕获链接）或者使用HTML parser库可以为你完成方框:)）

Answer 2

如果您查看维基百科的页面源代码，您会看到该页面包含重定向到维基百科页面的相对链接。这些页面在链接中不包含https或FTP子字符串。一个更好的机制是使用正则表达式查找所有标签或寻找html标签的解析器。这很简单，然后您可以从参考中纠正真实的链接。

Answer 3

你可以尝试这样的事情

soup=bs4.BeautifulSoup(current.read())
for tag in filter(None,map(WikiTag,soup.find_all("a",href=True))):
    print tag #convert the tag into a url and do something

WikiTag就像

def WikiTag(link):
    if not link["href"].startswith("/wiki/"):
       return None
    tag = link["href"][6:]
    if ":" in tag:
       return None
    return tag

python初学者2.x - re.findall（）找不到我需要的每个链接

3 个答案: