Question

我正在尝试创建一个删除html文档中所有标记的程序。所以我制作了这样的节目。

import urllib
loc_left = 0
while loc_left != -1 :
    html_code = urllib.urlopen("http://www.python.org/").read()

    loc_left = html_code.find('<')
    loc_right = html_code.find('>')

    str_in_braket = html_code[loc_left, loc_right + 1]

    html_code.replace(str_in_braket, "")

但它显示如下错误消息

lee@Lee-Computer:~/pyt$ python html_braket.py
Traceback (most recent call last):
  File "html_braket.py", line 1, in <module>
    import urllib
  File "/usr/lib/python2.6/urllib.py", line 25, in <module>
    import string
  File "/home/lee/pyt/string.py", line 4, in <module>
    html_code = urllib.urlopen("http://www.python.org/").read()
AttributeError: 'module' object has no attribute 'urlopen'

有一件事很有意思，如果我将代码输入到python中，上面的错误就不会显示出来。

Answer 1

您已将脚本string.py命名为。 urllib模块导入它，认为它与stdlib中的string模块相同，然后您的代码使用现在部分定义的{{1}上的属性那个尚未存在的模块。将您的脚本命名为其他内容。

Answer 2

第一步是下载文档，以便将其包含在字符串中：

import urllib
html_code = urllib.urlopen("http://www.python.org/").read() # <-- Note: this does not give me any sort of error

然后你有两个相当不错的选项，因为它们实际解析HTML文档而不是简单地寻找'＆lt;'和'＆gt;'字符：

选项1：使用Beautiful Soup

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

选项2：使用内置的Python HTMLParser类

from HTMLParser import HTMLParser

class TagStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

使用选项2的示例：

In [22]: strip_tags('<html>hi</html>')
Out[22]: 'hi'

如果您已经有BeautifulSoup，那就非常简单了。在TagStripper类和strip_tags函数中粘贴也非常简单。

祝你好运！

关于urlopen的简单python问题

2 个答案: