Question

我特别喜欢编程和OO编程。尽管如此，我正在尝试编写一个非常简单的Spider来进行网络爬行。这是我的第一个方法：

我需要从此页面中获取数据：http://europa.eu/youth/volunteering/evs-organisation_en

首先，我在页面源上查看HTML元素？ view-source:https://europa.eu/youth/volunteering/evs-organisation_en

注意：我需要获取此行下方的数据：

EVS认可组织搜索结果：6066

我为这份工作选择了美丽的汤 - 因为它非常强大：

我使用find_all：

soup.find_all('p')[0].get_text() # Searching for tags by class and id

注意：CSS使用类和ID来确定要应用某些样式的HTML元素。我们也可以在抓取时使用它们来指定我们想要抓取的特定元素。

见课程：

                  <div class="col-md-4">
            <div class="vp ey_block block-is-flex">
  <div class="ey_inner_block">
    <h4 class="text-center"><a href="/youth/volunteering/organisation/935175449_en" target="_blank">&quot;People need people&quot; Zaporizhya oblast civic organisation of disabled families</a></h4>
            <p class="ey_info">
    <i class="fa fa-location-arrow fa-lg"></i>
    Zaporizhzhya, <strong>Ukraine</strong>
</p>    <p class="ey_info"><i class="fa fa-hand-o-right fa-lg"></i> Sending</p>
                  <p><strong>PIC no:</strong> 935175449</p>
        <div class="empty-block">
      <a href="/youth/volunteering/organisation/935175449_en" target="_blank" class="ey_btn btn btn-default pull-right">Read more</a>    </div>
  </div>

所以这导致：

# import libraries
import urllib2
from bs4 import BeautifulSoup
page = requests.get("https://europa.eu/youth/volunteering/evs-organisation_en")
soup = BeautifulSoup(page.content, 'html.parser')
soup

现在，我们可以使用find_all方法按类或ID搜索项目。在下面的示例中，我们将搜索具有类外部文本

的任何p标记

<div class="col-md-4">

所以我们选择：

soup.find_all(class_="col-md-4")

现在我必须将所有这些结合起来。

更新：我的方法：到目前为止：

我使用BeautifulSoup4从网页中提取了包含在多个HTML标记内的数据。我想将所有提取的数据存储在列表中。而且 - 更具体一点：我希望每个提取的数据都是用逗号分隔的单独列表元素（即CSV格式化）。

从头开始：

这里我们有 HTML内容结构：

 <div class="view-content">
            <div class="row is-flex"></span>
                 <div class="col-md-4"></span>
            <div class </span>
  <div class= >
    <h4 Data 1 </span>
          <div class= Data 2</span>
            <p class=
    <i class=
     <strong>Data 3 </span>
</p>    <p class= Data 4 </span>
          <p class= Data 5 </span>
                  <p><strong>Data 6</span>
        <div class=</span>
      <a href="Data 7</span>
  </div>
</div>

要提取的代码：

for data in elem.find_all('span', class_=""):

这应该给出一个输出：

data = [ele.text for ele in soup.find_all('span', {'class':'NormalTextrun'})]
print(data)

输出： ['数据1'，'数据2'，'数据3'等]

问题： /我需要有关提取部分的帮助......

Answer 1

尝试

data = [ele.text for ele in soup.find_all(text = True) if ele.text.strip() != '']
print(data)

如何使用beautifulsoup刮掉整个网站

EVS认可组织搜索结果：6066

1 个答案: