如何使用美丽的汤4从span标签中提取文本?

时间:2016-05-19 06:39:37

标签: python-2.7 beautifulsoup scraper

如何使用beautful soup通过span标签刮取文本? scrape faculty members informations

from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.uoj.ac.ae/ContentBan.aspx?m=15&p=4&sm=4")
soup = BeautifulSoup(r.content, 'html5lib')
for tag in soup.find_all('table'):
    if tag.has_attr("class"):
        if tag['class'] == 'MsoTableGrid':
            for tag1 in soup.findAll('span'):
                print tag1.text

我想在span标签内打印文本,但我得到的输出是:

 Process finished with exit code 0

2 个答案:

答案 0 :(得分:1)

您可以使用CSS选择器找到tr tableMsoTableGrid >>> rows = soup.select("table.MsoTableGrid tr") >>> for r in rows: ... faculty_info = r.find_all("td")[1:3] ... if len(faculty_info) == 2: ... print faculty_info[0].text.strip(), faculty_info[1].text.strip() ... Name E-mail Dr. Hassan Ali Dabouq dr.hassandbouk@uoj.ac.ae Prof.dr.Magdie Medhat Elnahry magdielnahry@uoj.ac.ae Dr. Abd Elwahaab Mohamed Khalil abdelwahab@uoj.ac.ae Dr. Ahmed Hassan Fouly Dr.ahmedfoly@uoj.ac.ae Dr. Walid Mohamed Abbas walidabas@uoj.ac.ae Dr. Wael Mahmoud Fakhry wfakhry@uoj.ac.ae Dr. Kamel Abd Elaziz Ali kamelali@uoj.ac.ae . . . 个元素,然后获取所需信息,例如教师姓名和电子邮件地址,来自行的列,例如:

    var app = angular.module('app', []);

    app.controller('MainCtrl', function($scope) {

      $scope.severity_list = [{
        rank: 1,
        generic_value: 'severe'
      }, {
        rank: 2,
        generic_value: 'not so bad'
      }];

      $scope.initialiseOptions = function() {
        for (i = 0; i < $scope.severity_list.length; i++) {
          $scope.severity_list[i].Text = $scope.severity_list[i].rank + '-' + $scope.severity_list[i].generic_value;
        }
      }

      $scope.initialiseOptions();
      $scope.dropdownChanged = function() {
        if($scope.s_value){
          $scope.initialiseOptions(); // reset our previous selections         
          $scope.s_value.Text = $scope.s_value.rank;// Set our display to only rank after its chosen
        }        
      };

    });

答案 1 :(得分:0)

如果您想从所有范围文本中提取而不考虑类名,请尝试以下方法: -

from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.uoj.ac.ae/ContentBan.aspx?m=15&p=4&sm=4")
soup = BeautifulSoup(r.content, 'lxml')
span_text = soup.findAll('span')
for s in span_text:
    print(s.text)