Question

这是一个多部分的问题所以我道歉，我会尽力保持直截了当。

我使用BeautifulSoup从网页中提取链接，这里是代码和结果

问题：

我想排除没有airportname=XXX
然后，我想跟随airportname=XXX的链接，并在以下页面中搜索一串文字。

感谢您的耐心和帮助！

Answer 1

第一部分

您可以使用Regex

 import re

 XXX=[]
    for result in results:
        match = re.match( r'(airportname=\w\w\w)', result)
        if match:
            XXX.append(match.group(1))

第二部分

 for url in results:
         #hit the url and get the response as text and just search the text for the query string as in part 1

Answer 2

要完成此任务，需要实际的URL。要确定链接是否合适，可以使用以下方法：

from bs4 import BeautifulSoup
import re

html_page = urllib2.urlopen('http://www.website.com/airports')

soup = BeautifulSoup(html_page)

for link in soup.findAll('a', href=True):
    href = link['href']

    if re.search('airportname=\w\w\w$', href):
        print href

接下来，您需要根据获得的href创建完整的网址。

Python仅提取带有字符串的链接，并使用大写字母跟随链接

2 个答案: