我有一个小脚本,使用urllib2
获取网站内容,找到所有链接标记,在顶部和底部附加一小段HTML,然后我尝试美化它。它一直返回TypeError:序列项1:期望的字符串,Tag found。我环顾四周,我找不到问题。一如往常,任何帮助,非常感谢。
import urllib2
from BeautifulSoup import BeautifulSoup
import re
reddit = 'http://www.reddit.com'
pre = '<html><head><title>Page title</title></head>'
post = '</html>'
site = urllib2.urlopen(reddit)
html=site.read()
soup = BeautifulSoup(html)
tags = soup.findAll('a')
tags.insert(0,pre)
tags.append(post)
soup1 = BeautifulSoup(''.join(tags))
print soup1.prettify()
这是回溯:
Traceback (most recent call last): File "C:\Python26\bea.py", line 21, in <module>
soup1 = BeautifulSoup(''.join(tags))
TypeError: sequence item 1: expected string, Tag found
答案 0 :(得分:2)
这对我有用:
soup1 = BeautifulSoup(''.join(str(t) for t in tags))
这种pyparsing解决方案也提供了一些不错的输出:
from pyparsing import makeHTMLTags, originalTextFor, SkipTo, Combine
# makeHTMLTags defines HTML tag patterns for given tag string
aTag,aEnd = makeHTMLTags("A")
# makeHTMLTags by default returns a structure containing
# the tag's attributes - we just want the original input text
aTag = originalTextFor(aTag)
aEnd = originalTextFor(aEnd)
# define an expression for a full link, and use a parse action to
# combine the returned tokens into a single string
aLink = aTag + SkipTo(aEnd) + aEnd
aLink.setParseAction(lambda tokens : ''.join(tokens))
# extract links from the input html
links = aLink.searchString(html)
# build list of strings for output
out = []
out.append(pre)
out.extend([' '+lnk[0] for lnk in links])
out.append(post)
print '\n'.join(out)
打印:
<html><head><title>Page title</title></head>
<a href="http://www.reddit.com/r/pics/" >pics</a>
<a href="http://www.reddit.com/r/reddit.com/" >reddit.com</a>
<a href="http://www.reddit.com/r/politics/" >politics</a>
<a href="http://www.reddit.com/r/funny/" >funny</a>
<a href="http://www.reddit.com/r/AskReddit/" >AskReddit</a>
<a href="http://www.reddit.com/r/WTF/" >WTF</a>
.
.
.
<a href="http://reddit.com/help/privacypolicy" >Privacy Policy</a>
<a href="#" onclick="return hidecover(this)">close this window</a>
<a href="http://www.reddit.com/feedback" >volunteer to translate</a>
<a href="#" onclick="return hidecover(this)">close this window</a>
</html>
答案 1 :(得分:0)
soup1 = BeautifulSoup(''.join(unicode(tag) for tag in tags))
答案 2 :(得分:0)
Jonathans的一些语法错误回答,这是正确的:
soup1 = BeautifulSoup(''.join([unicode(tag) for tag in tags]))