Question

我正在尝试从html内部提取url，但似乎正则表达式无效。发现任何问题？虽然当我只为我的网站只使用html的一部分它工作正常（已经注释掉那部分代码）

我知道scapy和beautifulSoap等，但由于限制，我不想使用这些模块。

    page="ANY-XYZ-WEBSITE"

    def extract_first_link():
        urlopener=urllib.urlopen(page)
        html=str(urlopener.read())
        matchObj = re.match( '<a href="(.*)/([0-9a-zA-Z-]+)"', html, re.I)
        #k = open ("file.txt",'w')
        #k.write(html)
        #print "matchObj.group() : ", matchObj.group(1)
        #matchObj = re.match( '<a href="(.*)/([0-9a-zA-Z-]+)"', html[4111:4150], re.M|re.I)
        print "matchObj.group() : ", matchObj.group()
        print "matchObj.group() : ", matchObj.group(1)
        print "matchObj.group() : ", matchObj.group(2)

    if __name__=="__main__":
        print extract_first_link()

Answer 1

re.match只检查字符串的开头，re.search搜索所有字符串。

此处描述：https://docs.python.org/2/library/re.html

为什么正则表达式不起作用，python？

1 个答案: