从beautifulsoup lxml文件中提取文本

时间:2020-10-01 10:26:23

标签: python beautifulsoup

如何从div class="ember-view" id="ember760">开始在此lxml中提取文本。 请帮忙。我尝试了以下代码,但未捕获文本。

我尝试过的代码

#soup is an beautifulsoup element

exp = soup.find('header', {'class': 'pv-profile-section__card-header'})
exp

lxml文件

<div class="pv-recommendation-entity__highlights">
<blockquote class="pv-recommendation-entity__text relative">
<div class="ember-view" id="ember760"> <span class="lt-line-clamp__line">I know Abc from Data Analysis training sessions with abc,</span>
<span class="lt-line-clamp__line">Abc
is an enthusiastic candidature in training sessions. He is an</span>
<span class="lt-line-clamp__line">extremely capable and dedicated entry-level Data Science Analyst.</span>
<span class="lt-line-clamp__line">He is enhancing Analytics skills by his enthusiasm for learning new</span>
<span class="lt-line-clamp__line lt-line-clamp__line--last">
      things, and has learnt new tools like R, SPSS, and Pytho<span class="lt-line-clamp__ellipsis">...
            <a aria-expanded="false" class="lt-line-clamp__more" data-test-line-clamp-show-more-button="true" href="#" id="line-clamp-show-more-button" role="button">See more</a>
</span></span>
<!-- --><span class="lt-line-clamp__ellipsis lt-line-clamp__ellipsis--dummy">... <a class="lt-line-clamp__more" href="#" role="button">See more</a></span></div>
</blockquote>
</div>
</li>
</ul>
<!-- --></div>
</div></div>

预期产量

I know Abc from Data Analysis training sessions with abc,
is an enthusiastic candidature in training sessions. He is an
extremely capable and dedicated entry-level Data Science Analyst.
He is enhancing Analytics skills by his enthusiasm for learning new
      things, and has learnt new tools like R, SPSS, and Pytho

2 个答案:

答案 0 :(得分:2)

您可以使用CSS选择器div#ember760选择<div class="ember-view" id="ember760">.get_text()方法:

from bs4 import BeautifulSoup


txt = '''
<div class="pv-recommendation-entity__highlights">
<blockquote class="pv-recommendation-entity__text relative">
<div class="ember-view" id="ember760"> <span class="lt-line-clamp__line">I know Abc from Data Analysis training sessions with abc,</span>
<span class="lt-line-clamp__line">Abc
is an enthusiastic candidature in training sessions. He is an</span>
<span class="lt-line-clamp__line">extremely capable and dedicated entry-level Data Science Analyst.</span>
<span class="lt-line-clamp__line">He is enhancing Analytics skills by his enthusiasm for learning new</span>
<span class="lt-line-clamp__line lt-line-clamp__line--last">
      things, and has learnt new tools like R, SPSS, and Pytho<span class="lt-line-clamp__ellipsis">...
            <a aria-expanded="false" class="lt-line-clamp__more" data-test-line-clamp-show-more-button="true" href="#" id="line-clamp-show-more-button" role="button">See more</a>
</span></span>
<!-- --><span class="lt-line-clamp__ellipsis lt-line-clamp__ellipsis--dummy">... <a class="lt-line-clamp__more" href="#" role="button">See more</a></span></div>
</blockquote>
</div>
</li>
</ul>
<!-- --></div>
</div></div>'''

soup = BeautifulSoup(txt, 'lxml')

print(soup.select_one('div#ember760').get_text(strip=True, separator='\n'))

打印:

I know Abc from Data Analysis training sessions with abc,
Abc
is an enthusiastic candidature in training sessions. He is an
extremely capable and dedicated entry-level Data Science Analyst.
He is enhancing Analytics skills by his enthusiasm for learning new
things, and has learnt new tools like R, SPSS, and Pytho
...
See more
...
See more

答案 1 :(得分:1)

soup = BeautifulSoup(html, 'lxml')
lines = soup.select('div.ember-view > span.lt-line-clamp__line')
text = ''.join([line.find(text=True, recursive=False) for line in lines])
print(text)

给出文字:

I know Abc from Data Analysis training sessions with abc,Abc
is an enthusiastic candidature in training sessions. He is anextremely capable and dedicated entry-level Data Science Analyst.He is enhancing Analytics skills by his enthusiasm for learning new
      things, and has learnt new tools like R, SPSS, and Pytho

“查看更多..”将被忽略

相关问题