Question

我经常使用urllib2库来解析python中的网页。通常，URL的格式为：

page_url = 'http://www.website.com/webpage.html'

我用它来解析页面：

import urllib2

def read_page_contents(url):
    try:
        request = urllib2.Request(url)
        handle = urllib2.urlopen(request)
        content = handle.read()
    except:
        # aded as suggested by contributers below:
        import traceback
        traceback.print_exc()
        content = None
    return content

page = read_page_contents(page_url)
if page is not None:
    # start dealing with page contents
    pass

这没有问题，但是当我尝试一个没有html扩展名的URL时，如下所示， page_url ='https://energyplus.net/weather-region/north_and_central_america_wmo_region_4'

此方法失败来读取页面，它总是返回None！和错误消息

raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden:

我搜索了Stackoverflow，但根据我的关键字，我发现没有任何用处！

请帮我解决这个问题。

提前致谢

----------

我找到了答案，感谢以下2位贡献者的帮助：

导入请求

def read_page_contents(url):
    try:
        request = requests.get(url)
        content = request.content
    except:
        # aded as suggested by contributers below:
        import traceback
        traceback.print_exc()
        content = None
    return content

Answer 1

这与您的网址中没有.html这一事实无关。你的代码本身相当混乱。一个位置有page_url，另一个位置有continent_url。所以你将无法执行此代码。我假设这是一个复制粘贴问题。您的代码中的真正错误是

except:
    content = None

永远不要这样做。如果你有一个普通捕获所有异常，你绝对必须记录

except:
   import traceback
   traceback.print_exc()
   content = None

您将看到您尝试检索的页面存在真正的问题（原来是一个权限问题）。

Answer 2

使用requests并节省时间做更有意义的事情。

r.status_code: 200

出：

import java.util.ArrayList;

    public class NoOfPaths {
        static int xRows = 4; 
        static int yColumns = 4;
        static int noOfPaths = 0;

        /*A robot is located in the upper-left corner of a 4×4 grid. 
         * The robot can move either up, down, left, or right, 
         * but cannot go to the same location twice. 
         * The robot is trying to reach the lower-right corner of the grid. 
         * Your task is to find out the number of unique ways to reach the destination.
         **/

        static ArrayList validNeighbours (int x,int y, ArrayList visited) {
            ArrayList valid = new ArrayList();

            if((x+1 <= xRows) && !visited.contains(((x+1)*10)+y) ) {
                valid.add(((x+1)*10)+y);
            }
            if((x-1 >= 0) && !visited.contains(((x-1)*10)+y) ) {
                valid.add(((x-1)*10)+y);
            }
            if((y+1 <= yColumns) && !visited.contains(x*10+y+1) ) {
                valid.add(x*10+y+1);
            }
            if((y-1 >= 0) && !visited.contains(x*10+y-1) ) {
                valid.add(x*10+y-1);
            }

            return valid;
        }

        static void pathify(int x,int y,  ArrayList alreadyVisited) {
            if(x == xRows && y == yColumns) {
                noOfPaths++;
            } else {
                alreadyVisited.add(x*10+y);
                ArrayList callAgain = new ArrayList();
                callAgain = validNeighbours(x,y,alreadyVisited);
                for (int t=0,temp; t<callAgain.size(); t++) {
                    temp=(int) callAgain.get(t);
                    pathify(temp/10, temp%10, alreadyVisited);
                }

            }
        }

        public static void main(String[] args) {


            ArrayList alreadyVisited = new ArrayList();

            pathify(0, 0, alreadyVisited);

            System.out.println(noOfPaths);
        }

    }

如何使用Python阅读没有.htm *扩展名的网页？

2 个答案: