假设我有html读入我的程序:
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T & P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html">Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html">Sales Clerk</a> - <font size="-1"> (Kits)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html">MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html">Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html">Retail with small parts appliance background</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html">Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html">Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html">STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html">Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>
如何获取文本节点的内容?我想最终得到的是在终端上打印类似于这一行的内容:
http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - TRAVEL AGENT
到目前为止,我有以下代码可以提取href链接,但我不确定如何提取数据本身。我正在考虑从sgmllib.py模块覆盖handle_data(self, data)
,但到目前为止我似乎无法想到一种方法。
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k == "href"]
if href:
self.urls.extend(href)
谢谢!
答案 0 :(得分:8)
最简单的可能是BeautifulSoup(确保使用3.0.8或更高版本3.0.*
,不 3.1.*
,除非您使用的是Python 3 - - 见here!)。
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(thehtmlstring)
for anchor in soup.findAll('a'):
print anchor['href'], anchor.string
BeautifulSoup生成unicode字符串 - 如果这是一个问题,请务必对它们进行编码,因为您希望以您希望的方式获取字节字符串!
答案 1 :(得分:4)
我个人会使用lxml。安装完成后,获得您想要的东西很简单:
from lxml import html
tree = html.fromstring(open("data.html").read())
print [e.text_content() for e in tree.xpath("//a")]
答案 2 :(得分:2)
SGMLParser已在Python 2.6中弃用,并将在3.0中消失。您可能希望使用HTMLParser模块。我之前从未使用它(我总是只使用BeutifulSoup来做这些事情),所以我想我会学习它是如何工作的。这是我放在一起的示例脚本,它可以为您提供所需的内容。
#!/usr/bin/env python
from HTMLParser import HTMLParser
class URLParser(HTMLParser):
def __init__(self):
self.in_link = False
self.links = []
self.current_link = ''
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
if tag == 'a':
self.current_link = self.get_href_from_attrs(attrs)
self.in_link = True
def handle_endtag(self, tag):
if tag == 'a':
self.links.append(self.current_link)
self.in_link = False
def handle_data(self, data):
if self.in_link:
self.current_link = '%s - %s' % (self.current_link, data)
def get_href_from_attrs(self, attrs):
# The attrs dict is a list of tuples like:
# [('href', 'www.google.com'), ('class', 'some-class')]
for prop, val in attrs:
if prop == 'href':
return val
return ''
if __name__ == '__main__':
the_html = '''
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T & P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html">Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html">Sales Clerk</a> - <font size="-1"> (Kits)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html">MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html">Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html">Retail with small parts appliance background</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html">Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html">Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html">STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html">Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>
'''
url_parser = URLParser()
url_parser.feed(the_html)
print '\n'.join(url_parser.links)
<强> 输出 强>
http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - F/T - P/T Sales Associate - Caliente Fashions
http://vancouver.en.craigslist.ca/van/ret/1817804151.html - IMMEDIATE EMPLOYMENT WANTED!
http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html - TRAVEL AGENT
http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html - Optical Sales Position
http://vancouver.en.craigslist.ca/van/ret/1817709780.html - Sales Clerk
http://vancouver.en.craigslist.ca/van/ret/1817676850.html - MARINE SALES
http://vancouver.en.craigslist.ca/van/ret/1817608506.html - Retail Sales Associate
http://vancouver.en.craigslist.ca/van/ret/1817573985.html - Retail with small parts appliance background
http://vancouver.en.craigslist.ca/rds/ret/1817540938.html - Manager *Enjoyable work atmosphere
http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html - Team Member - Retail Store - FT
http://vancouver.en.craigslist.ca/rds/ret/1817459155.html - STORE MANAGER-SHOE WAREHOUSE
http://vancouver.en.craigslist.ca/pml/ret/1817448777.html - Retail Sales
更新:经过这个小练习后,这个界面感觉很糟糕,所以我只想坚持使用更清洁的BeutifulSoup库。请参阅Alex的样本以了解它是如何完成的。
答案 3 :(得分:1)
只要我们比较选项,此pyparsing代码段还会为您提供每个职位的位置,在<font>
代码后面的<a>
代码中给出:
from pyparsing import makeHTMLTags, SkipTo
a,aEnd = makeHTMLTags("A")
font,fontEnd = makeHTMLTags("FONT")
p,pEnd = makeHTMLTags("P")
patt = (p + a("a") + SkipTo(aEnd)("posn") + aEnd + '-' +
font + SkipTo(fontEnd)("locn") + fontEnd + pEnd)
for tokens,_,_ in patt.scanString(the_html):
print tokens.a.href, '-', tokens.posn, tokens.locn
给出:
http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - F/T & P/T Sales Associate - Caliente Fashions (North Vancouver)
http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html - TRAVEL AGENT (NORTH VANCOUVER)
http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html - Optical Sales Position (New Westminster)
http://vancouver.en.craigslist.ca/van/ret/1817709780.html - Sales Clerk (Kits)
http://vancouver.en.craigslist.ca/van/ret/1817676850.html - MARINE SALES (VANCOUVER ( KITS ))
http://vancouver.en.craigslist.ca/van/ret/1817608506.html - Retail Sales Associate (Vancouver)
http://vancouver.en.craigslist.ca/rds/ret/1817540938.html - Manager *Enjoyable work atmosphere (Langley Centre)
http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html - Team Member - Retail Store - FT (Burnaby South)
http://vancouver.en.craigslist.ca/rds/ret/1817459155.html - STORE MANAGER-SHOE WAREHOUSE (South Surrey-Semiahmoo)
http://vancouver.en.craigslist.ca/pml/ret/1817448777.html - Retail Sales (Coquitlam)
答案 4 :(得分:0)
#download BeautifulSoup library for python
from Beautiful import *
fh = open('data.html')
html = fh.read()
soup = BeautifulSoup(html)
tags = soup('a')
for tag in tags:
print tag.contents[0]