Question

我试图获取文章的所有链接（碰巧有类＆＃39;标题可能为空白＆＃39;表示它们）。我试图弄清楚为什么下面的代码会产生一大堆＆＃34; href =＆＃34;当我运行它时，而不是返回实际的URL。在失败的25篇文章网址（所有＆＃39; href =＆＃39;）之后，我还会得到一堆随机文字和链接，但不确定为什么会发生这种情况，因为它应该在停止查找课程后停止标题可能是空白＆＃39;。你能帮助我找出错误吗？

import urllib2

def get_page(page):

    response = urllib2.urlopen(page)
    html = response.read()
    p = str(html)
    return p

def get_next_target(page):
    start_link = page.find('title may-blank')
    start_quote = page.find('"', start_link + 4)
    end_quote = page.find ('"', start_quote + 1)
    aurl = page[start_quote+1:end_quote] # Gets Article URL
    return aurl, end_quote

def print_all_links(page):
    while True:
        aurl, endpos = get_next_target(page)
        if aurl:
            print("%s" % (aurl))
            print("")
            page = page[endpos:]
        else:
            break

reddit_url = 'http://www.reddit.com/r/worldnews'

print_all_links(get_page(reddit_url))

Answer 1

Rawing是正确的，但当我面对XY problem时，我更愿意提供完成X的最佳方法，而不是修复Y的方法。您应该使用像BeautifulSoup这样的HTML解析器来解析网页：

from bs4 import BeautifulSoup
import urllib2

def print_all_links(page):
    html = urllib2.urlopen(page).read()
    soup = BeautifulSoup(html)
    for a in soup.find_all('a', 'title may-blank ', href=True):
        print(a['href'])

如果你真的对HTML解析器过敏，至少要使用正则表达式（即使你应该坚持使用HTML解析）：

import urllib2
import re

def print_all_links(page):
    html = urllib2.urlopen(page).read()
    for href in re.findall(r'<a class="title may-blank " href="(.*?)"', html):
        print(href)

Answer 2

那是因为行

start_quote = page.find('"', start_link + 4)

没有做你认为的事情。 start_link设置为＆＃34;标题可以为空白＆＃34;的索引。所以，如果你在start_link + 4做一个page.find，你实际上开始搜索＆＃34; e may-blank＆＃34;。如果你改变了

start_quote = page.find('"', start_link + 4)

到

start_quote = page.find('"', start_link + len('title may-blank') + 1)

它会工作。

在Python中生成URL？

2 个答案: