Python:拉动粗体文本和后面的文本

时间:2016-06-10 08:24:52

标签: python html-parsing lxml

使用下面的html我想拉出2位数据并将它们添加到python中的列表中。每个粗体文字都是他的马名,然后是评论。

<div id="ANALYSIS" class="tabContent tabSelected">A weak handicap that looked wide open.
  <br>
  <br> <b class="black">LADY MAKFI</b> showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh.
  She saw it out well and it´ll be interesting to see how she copes with a rise.
  <br>
  <br> <b class="black">Weardiditallgorong</b> went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.
  <br>
  <br> <b class="black">Chauvelin</b>, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.
  <br>
  <br> <b class="black">Happy Jack</b> not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]
  <br>
  <br>
  <div id="resultRaceReport" class="hide"></div>
</div>

从上面的输出我希望它看起来像下面的

  

[LADY MAKFI,表现出极大的改进形式,以摆脱她的处女座   这是一个新院子的季节性首演。托尼为托尼提供了很少的东西   马丁去年,但确实在首次亮相时表现出了一些能力   新鲜时显然有能力。她很好地看到了它   有趣的是看她如何应对崛起。]

      [Weardiditallgorong,在这次长途旅行中战斗失败了   她最后一次在巴斯的第二次出场时可能再次提升。这是   她在AW上的最大努力。]

     

[Chauvelin,在第二次出现的眼罩中,最令人鼓舞   努力了一段时间,并且肯定以他最好的形式得到很好的对待。]

     

[快乐杰克,不是第一次轻松旅行直到制作   当被问及他的努力时,天气很恶劣。 [大卫·奥顿]]

但我只是不确定如何获得所需的输出(更多的背后逻辑)

我目前使用lxml来抓取内容,并且需要将粗体(马名称)与我的表格匹配,以便我可以将注释(粗体后面的文本)添加到我的数据库

2 个答案:

答案 0 :(得分:2)

使用 lxml

LADY MAKFI
[u' showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it\xc2\xb4ll be interesting to see how she copes with a rise.']
Weardiditallgorong
[' went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.']
Chauvelin
[', in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.']
Happy Jack
[' not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']

会给你:

from lxml import html

x = html.fromstring(h)
div = x.xpath("//*[@id='ANALYSIS']")[0]
out = [b.text + "," +  b.xpath("./following::text()[1]")[0].lstrip(",") for b in div.xpath(".//b[@class='black']")]

如果您希望将所有内容完整地发布在单个列表中:

[u'LADY MAKFI, showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it\xc2\xb4ll be interesting to see how she copes with a rise.',
 'Weardiditallgorong, went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.',
 'Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.',
 'Happy Jack, not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']

这给了你:

{{1}}

答案 1 :(得分:1)

我更喜欢Beautiful Soup的api直接使用lxml。我可以完全避免使用xpath,只需编写python。

import bs4 
soup = bs4.BeautifulSoup(document, 'lxml')
[b.text + b.next_sibling.rstrip() for b in soup.find_all('b')]

输出:

['LADY MAKFI showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh.\n  She saw it out well and it´ll be interesting to see how she copes with a rise.',
 'Weardiditallgorong went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.',
 'Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.',
 'Happy Jack not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']