提取<pre> tag using Beautifulsoup

时间:2016-10-24 22:01:48

标签: python-3.x beautifulsoup

I am trying to extract just a part of what is in between a tag on a somewhat simple webpage.

This is the page http://bridge.no/var/ruter/html/0237/2016-10-18.htm and I just want the first "table" that is in between the tags. What I have done is just this:

import pandas as pd
from bs4 import BeautifulSoup
import urllib.request

adress = 'http://bridge.no/var/ruter/html/0237/2016-10-18.htm'
response = urllib.request.urlopen(adress)
html = response.read()

soup = BeautifulSoup(html)
pre = soup.find_all('pre')

And I guess that was the easy part of it, after having looked at examples using BeautifulSoup I still have no idea how to do it. When looking at the source I see two possible things to split it on, a name='scoretables' or the long line of dashes. "----------------------------------------------------------------"

After that I would want to get that "table" into a pandas DataFrame, but I think I would be able to handle that part. Is there anyone who got some "pro" tips on how to use BeautifulSoup to do what I want?

1 个答案:

答案 0 :(得分:0)

要从第一个表中获取文本,您只需从.contents

中提取第一个元素即可
soup = BeautifulSoup(html,"lxml")
table = soup.find("pre").contents[0]

或者拉第一个文字:

table = soup.find("pre").find(text=True)
print(table)

两者都会给你:

13 bord, 25 par, 1 blindpar. Antall spill: 27. Frirunde (*) gir innspilt prosent.

Plass  Par   Poeng       %  Navn                                          MNR      Klubb                        

    1    1    68,0    61,4  Simon Rasmussen - Rolf Normann Hansen     13838  8056  Kolbotn BK - Ski BK          
    2    2    56,0    59,4  Åge Seiersten - Truls Bjerkås             27817 24421  Brandbu BK - Posten BK       
    3   26    52,9 *  58,9  Knut Karlsen - Bjørn Roar Haugen          13153 14791  Kolbotn BK                   
    4   16    48,4 *  58,1  Kåre Bogø - Svein Arild Naas Olsen        29525 14358  Nittedal BK - Kolbotn BK     
    5   23    44,0    57,4  Per Arild Kvist - Raymond Frivåg          18774 11751  Ski BK - Kolbotn BK          

    6   14    34,0    55,7  Øyvind Bronken - Arne Almendingen         18763  4387  Kolbotn BK - Ski BK          
    7    9    32,0    55,4  Karl Johan Bjørn - Trond M. Thorgersen     7964 33905  Kolbotn BK                   
    8   10    26,0    54,4  Arild Basma - Odd Arne Bertheussen        25130 11589  Kirkenes BK - Kolbotn BK     
    9    5    23,0    53,9  Gerd Irene Knutsen - Einar Knutsen        30543 30542  Kolbotn BK                   
   10    3    17,0    52,9  Bjørn Tore Hallén - Sven-Åge Lund         27205 27206  Posten BK                    

         6    17,0    52,9  Hege Johansen - Elisabeth Johansen        32129 32031  Kolbotn BK                   
   12   13    14,0    52,4  Helge Lian - Bård Lian                    38230 31391  Kolbotn BK                   
   13    4    11,0    51,9  Sven Pran - Bjørn Arne Ruud                9704 23063  Kolbotn BK                   
   14   21     2,3 *  50,4  Olav Hjerkinn - Marit Hjerkinn              675 25417  Ski BK - Kolbotn BK          
   15   17    -8,0    48,7  Geir Liabø - Bjarne Erlandsen             32924 11158  Kolbotn BK - Ski BK          

   16   19   -10,0    48,3  Tove Wikerholmen - Per Gunnar E. Frislid  30926 40386  Kolbotn BK                   
   17   20   -13,0    47,8  Arnold Digre - Tim Nørgaard                5683 29591  Kolbotn BK - Bridgekameratene
   18   22   -17,0    47,1  Trond Østlie - Arvid Ek                   33024 32178  Kolbotn BK                   
   19   15   -27,0    45,5  Andreas Jansen - Dag Amund Lie            38503  3636  Kolbotn BK - Ski BK          
   20   11   -39,4 *  43,4  John Sandberg - Berit Hornhammar          18745 25418  Kolbotn BK                   

   21   12   -40,5 *  43,2  Toralf Brandvoll - Else Heldre            35159 30744  Kolbotn BK                   
   22    7   -42,8 *  42,8  Mette Hugin - Liv Kongelf                 41332 37002  Kolbotn BK                   
   23    8   -69,8 *  38,3  Laila K. Siltvedt - Wencke Thorstensen    39025 39026  Kolbotn BK                   
   24   24   -93,4 *  34,3  Harald Molteberg - Terje J. Eriksen       18843 42314  Ski BK - Kolbotn BK          
   25   18  -118,1 *  30,1  Leif Åge Bergseng - Ulf Kopperud          18740 41968  Kolbotn BK - Hjerter Konge  

将它放入df将需要正则表达式,因为没有一致的分隔符。