尝试让简单的网页清理并运行。目标是将dt gm tm和ntv类转储到csv中 - 最终。这是json的清晰度。一步一步。
这是蜘蛛:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "schedule"
start_urls = [
'http://www.nba.com/schedules/national_tv_schedule/',
]
def parse(self, response):
for game in response.css('td'):
yield {
'date': game.css('td.dt::text').extract(),
'time': game.css('td.tm::text').extract(),
}
非常简单 - 但是像这样吐出:(为简洁而截断)
[
{"date": ["Sat, Oct 1", " ", "Sun, Oct 2", "Mon, Oct 3", " ", " ", " ", " ", " ", " "], "time": ["7:30 pm", "8:00 pm", "8:00 pm", "2:30 pm", "8:00 pm", "8:00 pm", "8:30 pm", "9:00 pm", "10:00 pm", "10:00 pm", "7:00 pm", "7:00 pm", "8:00 pm", "8:00 pm", "10:00 pm", "10:30 pm", "2:30 pm", "7:00 pm", "10:00 pm", "10:30 pm", "7:00 pm", "7:00 pm", "7:30 pm", "7:30 pm", "8:00 pm", "10:30 pm", "10:00 pm"]},
{"date": [], "time": []},
{"date": [], "time": []},
{"date": [], "time": []},
{"date": [], "time": []},
{"date": [], "time": []},
{"date": ["Sat, Oct 1"], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["7:30 pm"]},
{"date": [], "time": []},
{"date": [" "], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["8:00 pm"]},
{"date": [], "time": []},
{"date": ["Sun, Oct 2"], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["8:00 pm"]},
{"date": [], "time": []},
{"date": ["Mon, Oct 3"], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["2:30 pm"]},
{"date": [], "time": []},
{"date": [" "], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["8:00 pm"]},
{"date": [], "time": []},
{"date": [" "], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["8:00 pm"]},
{"date": [], "time": []},
{"date": [" "], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["8:30 pm"]},
{"date": [], "time": []},
{"date": [" "], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["9:00 pm"]},
{"date": [], "time": []},
{"date": [" "], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["10:00 pm"]},
{"date": [], "time": []},
{"date": [" "], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["10:00 pm"]},
{"date": [], "time": []}
]
第一个dict以正确的顺序拥有正确的数据,但没有整理。以下dicts没有正确匹配第一个字典中的数据。我尝试了一段时间的声明来取消换行,但没有成功。
有什么建议吗?我使用Scrapy教程构建了这个。我知道我最终需要插入正确的日期。
答案 0 :(得分:0)
您可能希望对所选的表格和行更具体。
查看日程安排开始的HTML:
<div id="scheduleMain" style="margin:0 5px 0 0!important;">
<table border="0" cellpadding="0" cellspacing="0" class="genSchedTable tvindex">
<tr class="header">
<td colspan="4">NATIONAL TV SCHEDULE - 2016-17</td>
</tr>
<tr class="title">
<td class="date">Date</td>
<td class="game">Teams</td>
<td class="time">Time (ET)</td>
<td class="natTV">Network</td>
</tr>
<tr>
<td class="dt">Mon, Oct 3</td>
<td class="gm"><a href="/thunder">Oklahoma City</a> @ <a href="/real_madrid">Real Madrid</a><br>Preseason
</td>
<td class="tm">2:30 pm</td>
<td class="ntv"><img border="0" src="http://i.cdn.turner.com/nba/nba/images/shrinkee_NBATV.gif"><img border="0" src="http://i.cdn.turner.com/nba/nba/images/shrinkee_NBAC.gif"></td>
</tr>
...
您可以看到您所在的表位于<div id="scheduleMain">
内。
您应该选择表格行(<tr>
),而不是for循环中的表格单元格,并且在每次循环迭代中,选择时间和日期的单元格:
def parse(self, response):
for game in response.css('#scheduleMain > table tr:nth-child(n+3)'):
yield {
'date': game.css('td.dt::text').extract(),
'time': game.css('td.tm::text').extract(),
}
tr:nth-child(n+3)
用于选择行3,4,5 ......(前2行是标题)