我一直在查看文档,但他们没有涉及此问题。我正在尝试提取所有文本和所有链接,但不是单独提取。我希望它们交错以保留上下文。我想最终得到一个交错的文本和链接列表。这甚至可以用BeautifulSoup吗?
答案 0 :(得分:0)
是的,这绝对是可能的。
import urllib2
import BeautifulSoup
request = urllib2.Request("http://www.example.com")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
print a
打破此代码段,您正在为网站(在本例中为Google.com)制作request
并使用BeautifulSoup解析response
。您的要求是查找所有链接和文本并保留上下文。上面代码的输出如下所示:
<a href="/"><img src="/_img/iana-logo-pageheader.png" alt="Homepage" /></a>
<a href="/domains/">Domains</a>
<a href="/numbers/">Numbers</a>
<a href="/protocols/">Protocols</a>
<a href="/about/">About IANA</a>
<a href="/go/rfc2606">RFC 2606</a>
<a href="/about/">About</a>
<a href="/about/presentations/">Presentations</a>
<a href="/about/performance/">Performance</a>
<a href="/reports/">Reports</a>
<a href="/domains/">Domains</a>
<a href="/domains/root/">Root Zone</a>
<a href="/domains/int/">.INT</a>
<a href="/domains/arpa/">.ARPA</a>
<a href="/domains/idn-tables/">IDN Repository</a>
<a href="/protocols/">Protocols</a>
<a href="/numbers/">Number Resources</a>
<a href="/abuse/">Abuse Information</a>
<a href="http://www.icann.org/">Internet Corporation for Assigned Names and Numbers</a>
<a href="mailto:iana@iana.org?subject=General%20website%20feedback">iana@iana.org</a>