这是我用Python / Beautifulsoup抓取HTML的一个例子:
<dl>
<dd>
<strong>
<a name="45790" href="http://www.eslcafe.com/jobs/china/index.cgi?read=45790">Monthly 18000rmb ESL teachers for Shanghai Webi centers</a>
</strong>
<br>
Webi English Shanghai -- Tuesday, 7 March 2017, at 2:17 p.m.
</dd>
<dd></dd>
<dd></dd>
<dd></dd>
</dl>
我能够抓住<a href>
,但我无法在<br>
之后获得文字,尽管我遇到了不同的循环。
这是我的计划:
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://www.eslcafe.com/jobs/china/').read()
soup = bs.BeautifulSoup(sauce, 'html.parser')
dl = soup.dl
ads = []
for words in dl.find_all('a'):
links = words.get('href')
link_text = words.text
link_text = link_text.lower()
if 'university' in link_text:
ads.append([links, link_text])
if 'universities' in link_text:
ads.append([links, link_text])
if 'college' in link_text:
ads.append([links, link_text])
if 'colleges' in link_text:
ads.append([links, link_text])
for ad in ads:
for job in ad:
print(job)
print("")
如果文本包含多个搜索字词,则重复项添加到列表中也存在问题,但我可以稍后处理。
我想我想要一个包含link
,link_text
和date_text
的列表的列表。
ads = [[link, link_text, date_text], [link, link_text, date_text]]
现在,我只能获得链接和link_text。
有什么建议吗?
答案 0 :(得分:0)
In [31]: for dd in soup.find_all('dd'):
...: link = dd.a.get('href')
...: link_text = dd.a.text
...: *_, dd_text = dd.stripped_strings
出:
http://www.eslcafe.com/jobs/china/index.cgi?read=45391
Teach English in Shenyang, China: Great salary, Support, and Structured program
Greenheart Travel -- Thursday, 9 February 2017, at 1:05 p.m.
dd_text
是dd标记的最后一个文本节点,因此我使用*_
来表示它之前的所有文本节点。
编辑:
In [20]: for dd in soup.find_all('dd'):
...:
...: d = {} # store data in a dict
...: d['link'] = dd.a.get('href')
...: d['link_text'] = dd.a.text
...: *_, dd_text = dd.stripped_strings
...: d['date_text'] = dd_text
...: print(d)
出:
{'date_text': 'EnglishTeacherChina.com -- Sunday, 12 February 2017, at 1:45 '
'p.m.',
'link': 'http://www.eslcafe.com/jobs/china/index.cgi?read=45426',
'link_text': '❤ ❤ ❤ Teach English In China 12,000-20,000 RMB/month - Adults '
'or Kids - Free Housing & Airfare - Free TEFL TESOL '
'Certification - Where You Want - YOUR NEEDS ARE OUR TOP '
'PRIORITY ❤ ❤ ❤'}
答案 1 :(得分:0)
您可以使用contents
import bs4
soup = bs4.BeautifulSoup('<dl> .... </dl>') # your markup
print(soup.br.contents[0])
给出:
Webi English Shanghai -- Tuesday, 7 March 2017, at 2:17 p.m.