使用python和bs4进行数据刮擦

时间:2018-05-29 18:10:04

标签: html python-3.x beautifulsoup

h2

我想使用a的类是“top-sec-title”,并使用href来删除h2上的文本。 以下示例是我在下面处理的内容,html有一个h3标记类,它帮助我在下面的情况下获得了<a class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor" href="/news/world-us-canada-44294366"> <h3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text"> Hurricane Maria 'killed 4,600 in Puerto Rico' </h3> </a> 子元素中的文本:

news = soup.find_all('a', attrs={'class':'gs-c-promo-heading gs-o-faux-block- 
link__overlay-link gel-pica-bold nw-o-link-split__anchor'})

for item in news:
    print(item.get(href))
    print(item.text)

以下代码是我用来从上面的html源中提取数据的代码。

disable Instant Run

2 个答案:

答案 0 :(得分:1)

这将为您提供包含h2元素的所有元素,如果封闭元素为href,则可以获取a

lst_of_h2 = soup.find_all('h2', {'class': 'top-sec-title'})
for h2 in lst_of_h2:
    h2.parent # enclosing element

答案 1 :(得分:0)

<强>代码:

EXPLAIN QUERY PLAN
SELECT DISTINCT(dog_id) FROM dog_bounds AS db, frisbees AS f
    WHERE db.max_x >= f.min_x AND db.max_y >= f.min_y
    AND db.min_x < f.max_x AND db.min_y < f.max_y;

0|0|0|SCAN TABLE dog_bounds AS db VIRTUAL TABLE INDEX 2:
0|1|1|SCAN TABLE frisbees AS f
0|0|0|USE TEMP B-TREE FOR DISTINCT

<强>输出:

html = '''
<a href="/news/2018/05/israeli-army-projectiles-fired-israel-gaza-180529051139606.html">
    <h2 class="top-sec-title">
        Israel launches counterattacks in Gaza amid soaring tensions
    </h2>
</a>
'''
soup = BeautifulSoup(html, 'lxml')

a_tags = [h.parent for h in soup.select('.top-sec-title')]

for a in a_tags:
    print(a['href'])
    print(a.get_text(strip=True))