我试图编写一个脚本作为后端"管理界面"对我来说,允许我输入一个包含团队足球时间表的URL,然后自动提取该时间表,然后将其保存到稍后要访问的文件中。
我已经能够在终端输入URL,打开该URL,遍历该URL中的每一行HTML,然后删除足够的HTML标签,然后显示两个单独的元素我想要什么(至少在包含的字符串......):游戏列表和这些游戏的日期列表;它们保存在两个单独的列表中,我将其另存为HTML文件,以便在浏览器中查看并确认我收到的数据。
注意:这些文件通过解析URL来获取文件名。
以下是我与之合作的示例网址:www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php
我现在面临的问题是双重的:
1)从两个列表中删除所有HTML,以便剩下的唯一内容是各自索引中的字符串。我已经尝试过BeautifulSoup,但过去一天我一直在用一块墙撞墙,梳理StackOverflow并尝试不同的方法。
没有骰子(用户错误,我是积极的)。
2)然后,在包含日期的列表中,将每组两个索引(即组合0和1,2和3,4和5等)组合成单个列表中的单个字符串索引。
从那里开始,我相信我已经找到了将两个列表合并到一个列表中的方法(这里有一个关于学习Python的教训,我相信这个方法很难实现,以及很多在StackOverflow上),但这两个对我来说是真正的阻挡者。
以下是我编写的代码,包括每个步骤的注释以及剩余的步骤,但我没有正常的代码:
# Import necessary modules
from urllib import urlopen
import sys
import urlparse
# Take user input to get the URL where schedule lives
team_url = raw_input("Insert the full URL of the team's schedule you'd like to parse: ")
# Parse the URL to grab the 'path' segment to whittle down and use as the file name
file_name = urlparse.urlsplit(team_url)
# Parse the URL to make the file name:
name_base = file_name.path
name_before = name_base.split("/")
name_almost = name_before[3]
name_after = name_almost.split(".")
name_final = name_after[0] + ".html"
name_final_s = name_after[0] + "sched" + ".html"
# Create an empty list to hold our HTML data:
team_data = []
schedule_data = []
# Grab the HTML file to then be written & parsed down to just team names:
for line in urlopen(team_url).readlines():
if "tr" in line:
if "a href=" in line:
if "strong" in line:
team_data.append(line.rstrip())
# Grab the HTML file to then be written & parsed down to just schedules:
for line in urlopen(team_url).readlines():
if 'td class="cfb1"' in line:
if "Buy" not in line:
schedule_data.append(line.rstrip())
# schedule_data[0::1] = [','.join(schedule_data[0::1])]
# Save team's game list file with contents of HTML:
with open(name_final, 'w') as fout:
fout.write(str(team_data))
# Save team's schedule file with contents of HTML:
with open(name_final_s, 'w') as fout:
fout.write(str(schedule_data))
# Remove all HTML tags from the game list file:
# Remove all HTML tags from the schedule list file:
# Combine necessary strings from the schedule list:
# Combine the two lists into a single list:
非常感谢任何帮助!
更新时间:5/27/2015,太平洋标准时间上午9:42
所以我在HTMLParser上玩弄了一下,我想我已经到了那里。这是新代码(仍在使用此网址:http://www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php):
# Import necessary modules
from HTMLParser import HTMLParser
from urllib import urlopen
import sys
import urlparse
import os
# Take user input to get the URL where schedule lives
team_url = raw_input("Insert the full URL of the team's schedule you'd like to parse: ")
# Parse the URL to grab the 'path' segment to whittle down and use as the file name
file_name = urlparse.urlsplit(team_url)
# Parse the URL to make the file name:
name_base = file_name.path
name_before = name_base.split("/")
name_almost = name_before[3]
name_after = name_almost.split(".")
name_final = name_after[0] + ".txt"
name_final_s = name_after[0] + "-dates" + ".txt"
# Create an empty list to hold our HTML data:
team_data = []
schedule_data = []
# Grab the HTML file to then be written & parsed down to just team names:
for line in urlopen(team_url).readlines():
if "tr" in line:
if "a href=" in line:
if "strong" in line:
team_data.append(line.rstrip())
# Grab the HTML file to then be written & parsed down to just schedules:
for line in urlopen(team_url).readlines():
if 'td class="cfb1"' in line:
if "Buy" not in line:
schedule_data.append(line.rstrip())
# schedule_data[0::1] = [','.join(schedule_data[0::1])]
# Save team's game list file with contents of HTML:
with open(name_final, 'w') as fout:
fout.write(str(team_data))
# Save team's schedule file with contents of HTML:
with open(name_final_s, 'w') as fout:
fout.write(str(schedule_data))
# Create file name path from pre-determined directory and added string:
game_file = open(os.path.join('/Users/jmatthicks/Documents/' + name_final))
schedule_file = open(os.path.join('/Users/jmatthicks/Documents/' + name_final_s))
# Utilize MyHTML Python HTML Parsing module via MyHTMLParser class
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered a start tag:", tag
def handle_endtag(self, tag):
print "Encountered an end tag :", tag
def handle_data(self, data):
print "Encountered some data :", data
# Create a game instance of HTMLParser:
game_parser = MyHTMLParser()
# Create a schedule instance of HTMLParster:
sched_parser = MyHTMLParser()
# Create function that opens and reads each line in a file:
def open_game():
run = open(os.path.join('/Users/jmatthicks/Documents/' + name_final)).readlines()
for x in run:
game_parser.feed(x)
def open_sched():
run = open(os.path.join('/Users/jmatthicks/Documents/' + name_final_s)).readlines()
for x in run:
sched_parser.feed(x)
open_game()
open_sched()
# Combine necessary strings from the schedule list:
# Combine the two lists into a single list:
# Save again as .txt files
# with open(name_final, 'w') as fout:
# fout.write(str(team_data))
#
# with open(name_final_s, 'w') as fout:
# fout.write(str(schedule_data))
所以,现在我正在解析它,我只需要从字符串中完全删除所有HTML标记,这样它就只剩下对手了,只剩下每个单独文件中剩余的日期。
我会继续努力,如果在此期间没有提供解决方案,我会在此处发布结果。
到目前为止,感谢所有帮助和见解,这位新秀非常感激。
答案 0 :(得分:0)
如果您对如何使用BeatifulSoup感到好奇,请参阅第(1)部分:
首先确保安装了正确的版本:
$ pip install beautifulsoup4
在你的python shell中:
from bs4 import BeautifulSoup
from urllib import urlopen
team_url = "http://www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php"
text = urlopen(team_url).read()
soup = BeautifulSoup(text)
table = soup.find('table', attrs={"class": "cfb-sch"})
data = []
for row in table.find_all('tr'):
data.append([cell.text.strip() for cell in row.find_all('td')])
print data
# should print out something like:
#[[u'2015 Texas A&M Aggies Football Schedule'],
# [u'Date', u'', u'Opponent', u'Time/TV', u'Tickets'],
# [u'SaturdaySep. 5',
# u'',
# u'Arizona State Sun Devils \r\n NRG Stadium, Houston, TX',
# u'7:00 p.m. CT\r\nESPN network',
# u'Buy\r\nTickets'],
# [u'SaturdaySep. 12',
# u'',
# u'Ball State Cardinals \r\n Kyle Field, College Station, TX',
# u'TBA',
# u'Buy\r\nTickets'],
# ...
答案 1 :(得分:0)
只要您确定了所需的标签,使用BeautifulSoup并查看页面的HTML应该非常简单。这是代码:
import urllib2
from bs4 import BeautifulSoup
def main():
url = 'http://www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
table = soup.find("table",{"class" : "cfb-sch"})
# Working on the teams
teams_td = table.findAll("td",{"class" : "cfb2"})
teams = []
for t in teams_td:
teams.append(t.text.split('\r\n')[0].strip())
# Working on the dates
dates_td = table.findAll("td",{"class" : "cfb1"})
dates = []
# In the HTML table only 1 on 3 cfb1 is the date
for i in range(0,len(dates_td),3):
dates.append(dates_td[i].text)
# Print everytin
for s in zip(dates, teams):
print s
if __name__ == '__main__':
main()
当你运行它时,你应该得到这个:
(u'SaturdaySep. 5', u'Arizona State Sun Devils')
(u'SaturdaySep. 12', u'Ball State Cardinals')
(u'SaturdaySep. 19', u'Nevada Wolf Pack')
(u'SaturdaySep. 26', u'at Arkansas Razorbacks')
(u'SaturdayOct. 3', u'Mississippi State Bulldogs')
(u'SaturdayOct. 10', u'Open Date')
(u'SaturdayOct. 17', u'Alabama Crimson Tide')
(u'SaturdayOct. 24', u'at Ole Miss Rebels')
(u'SaturdayOct. 31', u'South Carolina Gamecocks')
(u'SaturdayNov. 7', u'Auburn Tigers')
(u'SaturdayNov. 14', u'Western Carolina Catamounts')
(u'SaturdayNov. 21', u'at Vanderbilt Commodores')
(u'Saturday\r\n Nov. 28', u'at LSU Tigers')
(u'SaturdayDec. 5', u'SEC Championship Game')
我希望这会对你有所帮助。