Beautifulsoup - 麻烦刮掉带有链接的datalist

时间:2017-03-11 16:16:26

标签: python beautifulsoup

这是我用Python / Beautifulsoup抓取HTML的一个例子:

<dl>
<dd>
    <strong>
        <a name="45790" href="http://www.eslcafe.com/jobs/china/index.cgi?read=45790">Monthly 18000rmb ESL teachers for Shanghai Webi centers</a>
    </strong>
    <br>
    Webi English Shanghai -- Tuesday, 7 March 2017, at 2:17 p.m.
</dd>

<dd></dd>
<dd></dd>
<dd></dd>
</dl>

我能够抓住<a href>,但我无法在<br>之后获得文字,尽管我遇到了不同的循环。

这是我的计划:

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('http://www.eslcafe.com/jobs/china/').read()

soup = bs.BeautifulSoup(sauce, 'html.parser')

dl = soup.dl

ads = []

for words in dl.find_all('a'):
    links = words.get('href')
    link_text = words.text
    link_text = link_text.lower()

    if 'university' in link_text:
        ads.append([links, link_text])

    if 'universities' in link_text:
        ads.append([links, link_text])

    if 'college' in link_text:
        ads.append([links, link_text])

    if 'colleges' in link_text:
        ads.append([links, link_text])

for ad in ads:
    for job in ad:
        print(job)
        print("")             

如果文本包含多个搜索字词,则重复项添加到列表中也存在问题,但我可以稍后处理。

我想我想要一个包含linklink_textdate_text的列表的列表。

ads = [[link, link_text, date_text], [link, link_text, date_text]]

现在,我只能获得链接和link_text。

有什么建议吗?

2 个答案:

答案 0 :(得分:0)

In [31]: for dd in soup.find_all('dd'):
    ...:     link = dd.a.get('href')
    ...:     link_text = dd.a.text
    ...:     *_, dd_text = dd.stripped_strings

出:

http://www.eslcafe.com/jobs/china/index.cgi?read=45391
Teach English in Shenyang, China: Great salary, Support, and Structured program
Greenheart Travel -- Thursday, 9 February 2017, at 1:05 p.m.

dd_text是dd标记的最后一个文本节点,因此我使用*_来表示它之前的所有文本节点。

编辑:

In [20]: for dd in soup.find_all('dd'):
    ...:     
    ...:     d = {} # store data in a dict
    ...:     d['link'] = dd.a.get('href')
    ...:     d['link_text'] = dd.a.text
    ...:     *_, dd_text = dd.stripped_strings
    ...:     d['date_text'] = dd_text
    ...:     print(d)

出:

{'date_text': 'EnglishTeacherChina.com -- Sunday, 12 February 2017, at 1:45 '
              'p.m.',
 'link': 'http://www.eslcafe.com/jobs/china/index.cgi?read=45426',
 'link_text': '❤ ❤ ❤ Teach English In China 12,000-20,000 RMB/month - Adults '
              'or Kids - Free Housing & Airfare - Free TEFL TESOL '
              'Certification - Where You Want - YOUR NEEDS ARE OUR TOP '
              'PRIORITY ❤ ❤ ❤'}

答案 1 :(得分:0)

您可以使用contents

import bs4
soup = bs4.BeautifulSoup('<dl> .... </dl>') # your markup  
print(soup.br.contents[0])

给出:

Webi English Shanghai -- Tuesday, 7 March 2017, at 2:17 p.m.