想要从HTML文档中提取文本

时间:2016-07-21 10:03:22

标签: python html beautifulsoup

我想从the Kickstarter website获取一些信息。信息是结构化的,每个Kickstarter项目代码看起来都是一样的:

<div class="project-card-content">
<h6 class="project-title"><a data-pid="714867756" data-score="null" data-version="null" href="/projects/massoudhassani/mine-kafon-drone?ref=category_recommended" target="">Mine Kafon Drone</a></h6> <p class="project-byline">Massoud Hassani</p>
<p class="project-blurb">
Introducing the Mine Kafon Drone, an airborne demining system  developed to clear all land mines around the world in less than 10 years
</p>
</div>

<div class="project-card-content">我需要以下三个字符串。例如:

  1. Mine Kafon Drone
  2. Massoud Hassani
  3. 介绍Mine Kafon无人机,这是一种空中排雷系统,用于在不到10年的时间内清除全球所有地雷。
  4. 对于第一个结果,我在Python中使用了这段代码:

    import urllib
        import urllib.request
        from bs4 import BeautifulSoup
    
        theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=popularity&seed=2448324&page=1"
        thepage = urllib.request.urlopen(theurl)
        soup = BeautifulSoup(thepage,"html.parser")
    
        project1 = soup.find('div', {'class': 'project-card-content'}).findChildren('a')
        print (project1)
    

    结果是:

    [<a data-pid="714867756" data-score="null" data-version="null" href="/projects/massoudhassani/mine-kafon-drone?ref=category_recommended" target="">Mine Kafon Drone</a>]
    

    但我只想要字符串"Mine Kafon Drone"

1 个答案:

答案 0 :(得分:1)

只需从第一个&#34; a&#34;标记你已找到。

text = project1[0].text
print(text)

结果将是:

Mine Kafon Drone

从每个人那里获取数据:

data = []
for div in soup.find('div', class_='project-card-content'):
    data.append(div.find('div', class_='project-title').text)