Question

我知道这不是很漂亮的代码，而且我确信有一种更简单的方法，但我更关心为什么python没有剥离我要求的字符。

import urllib, sgmllib


zip_code = raw_input('Give me a zip code> ')
url = 'http://www.uszip.com/zip/' + zip_code
print url

conn = urllib.urlopen('http://www.uszip.com/zip/' + zip_code)

i = 0
while i < 1000:
    for line in conn.fp:
            if i == 1:
                print line[7:-10]
                i += 1
            elif i == 344:
                line1 = line.strip()
                line2 = line1.strip('<td>') #its not stripping the characters 
                print line2[17:-60]
                i += 1
            else:
            i += 1

Answer 1

您调用它的方式应该删除<，>，t和d个字符以及only at the beginning or end of the string：< / p>

>>> '<p>some test</p>'.strip('<td>')
'p>some test</p'

如果要删除子串<td>的每一个匹配项，请使用replace：

>>> '<td>some test</td>'.replace('<td>', '')
'some test</td>'

请注意，如果您想将其用于某种输入清理，则可以轻松规避：

>>> '<td<td>>some test</td>'.replace('<td>', '')
'<td>some test</td>'

这只是人们在尝试编写自己的HTML解析代码时通常会被搞砸的众多方法之一，所以也许您更愿意使用像BeautifulSoup这样的HTML解析库或像{{{{}}这样的XML解析器。 3}}

Answer 2

            line2 = line1.strip('<td>') #its not stripping the characters

它不会删除字符串<td>，而是删除字符串中的字符。所以它会脱掉＆lt;和＆gt;和t和d，在字符串的开头和结尾。

但是，一般来说，尝试从网页中提取数据的方法很糟糕。请查看BeautifulSoup以获得更好的认可。

Answer 3

参数：

以下是参数的详细信息：

chars: characters to be removed from beginning or end of the string.

看起来它只需要在字符串的开头或结尾处。否则，我建议使用正则表达式。

剥离python中的字符

3 个答案: