Question

我有以下代码（doop.py），它删除了所有'废话'html脚本的.html文件，只输出'人类可读'文字;例如。它将采用包含以下内容的文件：

<html>
<body>

<a href="http://www.w3schools.com">
This is a link</a>

</body>
</html>

并给予

$ ./doop.py
File name: htmlexample.html

This is a link

我需要做的下一件事是添加一个函数，如果文件中的任何html参数代表一个URL（一个Web地址），程序将读取指定网页的内容而不是磁盘文件。（出于目前的目的，doop.py足以将以http：//开头的参数（在任何letter-cases混合中）识别为URL。）

我不知道从哪里开始 - 我确定它会告诉python打开一个URL，但我该怎么做？

谢谢，

A

Answer 1

除了其他人已提到的urllib2之外，您还可以查看Kenneth Reitz撰写的Requests模块。它比urllib2具有更简洁和富有表现力的语法。

import requests
r = requests.get('https://api.github.com', auth=('user', 'pass'))
r.text

Answer 2

与大多数pythonic一样：有一个库。

这里需要urllib2库

这允许您像文件一样打开一个URL，并像文件一样从中读取和写入。

您需要的代码如下所示：

import urllib2

urlString = "http://www.my.url"
try:
    f = urllib2.urlopen(urlString)  #open url
    pageString = f.read()           #read content
    f.close()                       #close url
    readableText = getReadableText(pageString)
    #continue using the pageString as you wish
except IOException:
    print("Bad URL")

更新：（我手边没有python解释器，因此无法测试此代码是否有效，但应该!!）打开URL很容易，但首先需要从html文件中提取URL。这是使用正则表达式（正则表达式）完成的，毫不奇怪，python有一个库（重新）。我建议您阅读两个正则表达式，但它们基本上是一个可以匹配文本的模式。

所以你需要做的是编写一个匹配URL的正则表达式：

（HTTP | FTP | HTTPS）：// [\ W-_] + +（[\ W - ，@ ^ =％安培（[\ W-_] +。）;：/？〜+＃ ] * [\ W - \ @ ^ =％安培; /〜+＃]）？如果您不想关注urp到ftp资源，请删除“ftp |”从模式的开始。现在，您可以扫描输入文件以查找与此模式匹配的所有字符序列：

import re

input_file_str = #open your input file and read its contents
pattern = re.compile("(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?") #compile the pattern matcher
matches = pattern.findall(input_file_str) #find all matches, storing them in an interator
for match in matches :  #go through iteratr
    urlString = match   #get the string that matched the pattern
    #use the code above to load the url using matched string!

应该这样做