I am trying to extract just a part of what is in between a tag on a somewhat simple webpage.
This is the page http://bridge.no/var/ruter/html/0237/2016-10-18.htm and I just want the first "table" that is in between the tags. What I have done is just this:
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
adress = 'http://bridge.no/var/ruter/html/0237/2016-10-18.htm'
response = urllib.request.urlopen(adress)
html = response.read()
soup = BeautifulSoup(html)
pre = soup.find_all('pre')
And I guess that was the easy part of it, after having looked at examples using BeautifulSoup I still have no idea how to do it. When looking at the source I see two possible things to split it on, a name='scoretables' or the long line of dashes. "----------------------------------------------------------------"
After that I would want to get that "table" into a pandas DataFrame, but I think I would be able to handle that part. Is there anyone who got some "pro" tips on how to use BeautifulSoup to do what I want?
答案 0 :(得分:0)
要从第一个表中获取文本,您只需从.contents
中提取第一个元素即可soup = BeautifulSoup(html,"lxml")
table = soup.find("pre").contents[0]
或者拉第一个文字:
table = soup.find("pre").find(text=True)
print(table)
两者都会给你:
13 bord, 25 par, 1 blindpar. Antall spill: 27. Frirunde (*) gir innspilt prosent.
Plass Par Poeng % Navn MNR Klubb
1 1 68,0 61,4 Simon Rasmussen - Rolf Normann Hansen 13838 8056 Kolbotn BK - Ski BK
2 2 56,0 59,4 Åge Seiersten - Truls Bjerkås 27817 24421 Brandbu BK - Posten BK
3 26 52,9 * 58,9 Knut Karlsen - Bjørn Roar Haugen 13153 14791 Kolbotn BK
4 16 48,4 * 58,1 Kåre Bogø - Svein Arild Naas Olsen 29525 14358 Nittedal BK - Kolbotn BK
5 23 44,0 57,4 Per Arild Kvist - Raymond Frivåg 18774 11751 Ski BK - Kolbotn BK
6 14 34,0 55,7 Øyvind Bronken - Arne Almendingen 18763 4387 Kolbotn BK - Ski BK
7 9 32,0 55,4 Karl Johan Bjørn - Trond M. Thorgersen 7964 33905 Kolbotn BK
8 10 26,0 54,4 Arild Basma - Odd Arne Bertheussen 25130 11589 Kirkenes BK - Kolbotn BK
9 5 23,0 53,9 Gerd Irene Knutsen - Einar Knutsen 30543 30542 Kolbotn BK
10 3 17,0 52,9 Bjørn Tore Hallén - Sven-Åge Lund 27205 27206 Posten BK
6 17,0 52,9 Hege Johansen - Elisabeth Johansen 32129 32031 Kolbotn BK
12 13 14,0 52,4 Helge Lian - Bård Lian 38230 31391 Kolbotn BK
13 4 11,0 51,9 Sven Pran - Bjørn Arne Ruud 9704 23063 Kolbotn BK
14 21 2,3 * 50,4 Olav Hjerkinn - Marit Hjerkinn 675 25417 Ski BK - Kolbotn BK
15 17 -8,0 48,7 Geir Liabø - Bjarne Erlandsen 32924 11158 Kolbotn BK - Ski BK
16 19 -10,0 48,3 Tove Wikerholmen - Per Gunnar E. Frislid 30926 40386 Kolbotn BK
17 20 -13,0 47,8 Arnold Digre - Tim Nørgaard 5683 29591 Kolbotn BK - Bridgekameratene
18 22 -17,0 47,1 Trond Østlie - Arvid Ek 33024 32178 Kolbotn BK
19 15 -27,0 45,5 Andreas Jansen - Dag Amund Lie 38503 3636 Kolbotn BK - Ski BK
20 11 -39,4 * 43,4 John Sandberg - Berit Hornhammar 18745 25418 Kolbotn BK
21 12 -40,5 * 43,2 Toralf Brandvoll - Else Heldre 35159 30744 Kolbotn BK
22 7 -42,8 * 42,8 Mette Hugin - Liv Kongelf 41332 37002 Kolbotn BK
23 8 -69,8 * 38,3 Laila K. Siltvedt - Wencke Thorstensen 39025 39026 Kolbotn BK
24 24 -93,4 * 34,3 Harald Molteberg - Terje J. Eriksen 18843 42314 Ski BK - Kolbotn BK
25 18 -118,1 * 30,1 Leif Åge Bergseng - Ulf Kopperud 18740 41968 Kolbotn BK - Hjerter Konge
将它放入df将需要正则表达式,因为没有一致的分隔符。