Question

我无法让我的程序工作，而且我已经尝试了很长时间。在这里，非常简单，但我无法得到它。假设返回包含“html”的内容。真的很令人沮丧。这是用于命令行python 2.x

#!/usr/bin/env python

import sys
import re

#Make this program work both on python 2.x and Python 3.x
if (sys.version_info[0] == 3): raw_input = input

import urllib2
url = urllib2.urlopen('http://makeitwork.com/')
data = url.read()
urlsearch = re.findall(r'href=[\'"]?([^\'"]+)' , data)

for x in urlsearch:
    line = x.split()
    print(" %s" %line[0])

Answer 1

尝试BeautifulSoup，Never use regex to parse HTML code：

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen('http://makeitwork.com/')
data = url.read()

soup = BeautifulSoup(data)
for i in soup.find_all(a):
    print(link.get('href'))

Answer 2

尝试使用此RegEx

'r'a\shref="/?(.*)">'

基本上在<a href html标记之后和>结束语句之前搜索任何内容。

返回多个“href”

2 个答案: