使用BeautifulSoup提取网站数据

时间:2018-12-23 17:59:39

标签: python web-scraping beautifulsoup

我正在尝试从this site中提取时间表数据。内容包含在类为.departures-table的div中。我想忽略前两行并将数据存储在数组中,但是它不起作用。我显然犯了一个错误,但找不到哪个。谢谢

    snav_live_departures_url = "https://www.snav.it/"
    headers = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.3'}
    request = urllib.request.Request(snav_live_departures_url,headers=headers)
    html = urllib.request.urlopen(request).read()
    soup = BeautifulSoup(html,'html.parser')
    snav_live_departures = []
    snav_live_departures_table = list(soup.select('.departures-table div')) [2:]
print(snav_live_departures_table)
for div in snav_live_departures_table:
    div = div.select('departures-row')
    snav_live_departures.append({
        'TIME':div[4].text,
        'DEPARTURE HARBOUR':div[0].text,
        'ARRIVAL HARBOUR':div[1].text,
        'STATUS':td[3].select('span.tt-text')[0].text,
        'PURCHASE LINK':div[6].select('a')[0].attrs['href']
    })

2 个答案:

答案 0 :(得分:2)

这里发生了一些不同的事情:

  1. html 不包含所需的数据,而是通过JavaScript回调加载的,可以通过查看页面源的输出以及查看{{3} }在开发人员工具中
  2. 您实际上是“幸运的”页面上没有数据,否则此代码将以""" Created on Mon Dec 17 17:33:01 2018 @author: Jennie """ moves = ['rock', 'paper', 'scissors'] import random #Create player class class Player: def move(self): return 'rock' def learn(self, my_move, their_move): pass #Create random player class class RandomPlayer: def __init__(self): Player.__init__(self) def move(self): #use imported random function & choice choices = ['Rock', 'Paper', 'Scissors'] random_player = random.choice(choices) #Computer choice is either rock, paper, or scissors if random_player == ("Rock"): print("Opponent played Rock") elif random_player == ("Paper"): print("Opponent played Paper") else: print("Opponent played Scissors") #return value return random_player #Create human player class class HumanPlayer: def __init__(self): Player.__init__(self) def move(self): while True: human_player = input("'Rock', 'Paper', or 'Scissors' ") #Detect invalid entry if human_player.lower() not in moves: print('Please choose Paper, Rock or Scissors: ') else: break return human_player ##class that remembers what move the opponent played last round class ReflectPlayer: def __init__(self, ReflectPlayer): Player.__init__(self) self.ReflectPlayer = ReflectPlayer # def move def move(self, move): self.move = move def getmove(self, move): return self.move #define cycleplayer class that remembers what move it played last round, # and cycles through the different moves. class CyclePlayer: def __init__(self, CyclePlayer): Player.__init__(self) self.CyclePlayer = CyclePlayer self.human_player_history = {} # stores the frequency of human player moves for move in moves: self.human_player_history[move] = 0 def move(self, max_move): max_move = max(self.human_player_history.items(), key=lambda elem: elem[1])[0] if max_move == 'rock': return 'paper' if max_move == 'scissors': return 'rock' if max_move == 'paper': return 'rock' def beats(move1, move2): if ((move1 == 'rock' and move2 == 'rock') or (move1 == 'paper' and move2 == 'paper') or (move1 == 'scissors' and move2 == 'scissors')): return "** It's a TIE **" elif ((move1 == 'rock' and move2 == 'scissors') or (move1 == 'scissors' and move2 == 'paper') or (move1 == 'paper' and move2 == 'rock')): return "** Human WINS **" else: return "** Random Player WINS **" #Create game class class Game: def __init__(self, human_player, random_player): self.player1 = human_player self.player2 = random_player self.player1_score = 0 self.player2_score = 0 def play_round(self): move1 = self.player1.move() move2 = self.player2.move() print(f"Player 1: {move1} Player 2: {move2}") if (move1 == move2): print("it's a tie!") elif beats(move1, move2) is True: self.player1_score += 1 elif beats(move2, move1) is True: self.player2_score += 1 print(f"Scores, HumanPlayer: {self.player1_score} RandomPlayer: {self.player2_score}") def play_game(self): print("Game start!") for round in range(4): print(f"Round {round}:") self.play_round() print("Game over!") if __name__ == '__main__': game = Game(HumanPlayer(), RandomPlayer()) game.play_game() 爆炸,因为NameError不在范围内:

    td
  3. 目前还不清楚您打算如何处理该行,因为这些子元素不是 'DEPARTURE HARBOUR':td[0].text, 元素,它们都是<td> s

我认为您可能最高兴的是模仿API调用,从响应中剥离JS回调文本,并使用结构化数据:

<div>

答案 1 :(得分:0)

如上所述,在处理类似JavaScript的页面时,您可能需要在浏览器中的Dev Tools上监视Network,以查看数据的加载方式。

此代码将生成一个漂亮的字典,供您根据需要解析数据:

import requests
import json

URL = 'https://booking.snav.it/api/v1/dashboard/nextDepartures?callback=jQuery12345&_=12345'

r = requests.get(URL)
s = r.content.decode('utf-8')
data = json.loads(s[16:len(s)-2])