scrapy NBA赛程没有正确整理

时间:2016-10-02 15:40:47

标签: python scrapy

尝试让简单的网页清理并运行。目标是将dt gm tm和ntv类转储到csv中 - 最终。这是json的清晰度。一步一步。

这是蜘蛛:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "schedule"
    start_urls = [
        'http://www.nba.com/schedules/national_tv_schedule/',
    ]

    def parse(self, response):
        for game in response.css('td'):
            yield {
                'date': game.css('td.dt::text').extract(),
                'time': game.css('td.tm::text').extract(),
            }

非常简单 - 但是像这样吐出:(为简洁而截断)

[
{"date": ["Sat, Oct 1", " ", "Sun, Oct 2", "Mon, Oct 3", " ", " ", " ", " ", " ", " "], "time": ["7:30 pm", "8:00 pm", "8:00 pm", "2:30 pm", "8:00 pm", "8:00 pm", "8:30 pm", "9:00 pm", "10:00 pm", "10:00 pm", "7:00 pm", "7:00 pm", "8:00 pm", "8:00 pm", "10:00 pm", "10:30 pm", "2:30 pm", "7:00 pm", "10:00 pm", "10:30 pm", "7:00 pm", "7:00 pm", "7:30 pm", "7:30 pm", "8:00 pm", "10:30 pm", "10:00 pm"]},
{"date": [], "time": []},
{"date": [], "time": []},
{"date": [], "time": []},
{"date": [], "time": []},
{"date": [], "time": []},
{"date": ["Sat, Oct 1"], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["7:30 pm"]},
{"date": [], "time": []},
{"date": [" "], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["8:00 pm"]},
{"date": [], "time": []},
{"date": ["Sun, Oct 2"], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["8:00 pm"]},
{"date": [], "time": []},
{"date": ["Mon, Oct 3"], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["2:30 pm"]},
{"date": [], "time": []},
{"date": [" "], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["8:00 pm"]},
{"date": [], "time": []},
{"date": [" "], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["8:00 pm"]},
{"date": [], "time": []},
{"date": [" "], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["8:30 pm"]},
{"date": [], "time": []},
{"date": [" "], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["9:00 pm"]},
{"date": [], "time": []},
{"date": [" "], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["10:00 pm"]},
{"date": [], "time": []},
{"date": [" "], "time": []},
{"date": [], "time": []},
{"date": [], "time": ["10:00 pm"]},
{"date": [], "time": []}
]

第一个dict以正确的顺序拥有正确的数据,但没有整理。以下dicts没有正确匹配第一个字典中的数据。我尝试了一段时间的声明来取消换行,但没有成功。

有什么建议吗?我使用Scrapy教程构建了这个。我知道我最终需要插入正确的日期。

1 个答案:

答案 0 :(得分:0)

您可能希望对所选的表格和行更具体。

查看日程安排开始的HTML:

   <div id="scheduleMain" style="margin:0 5px 0 0!important;">
   <table border="0" cellpadding="0" cellspacing="0" class="genSchedTable tvindex">
      <tr class="header">
         <td colspan="4">NATIONAL TV SCHEDULE - 2016-17</td>
      </tr>
      <tr class="title">
         <td class="date">Date</td>
         <td class="game">Teams</td>
         <td class="time">Time (ET)</td>
         <td class="natTV">Network</td>
      </tr>
      <tr>
         <td class="dt">Mon, Oct 3</td>
         <td class="gm"><a href="/thunder">Oklahoma City</a> @ <a href="/real_madrid">Real Madrid</a><br>Preseason

         </td>
         <td class="tm">2:30 pm</td>
         <td class="ntv"><img border="0" src="http://i.cdn.turner.com/nba/nba/images/shrinkee_NBATV.gif"><img border="0" src="http://i.cdn.turner.com/nba/nba/images/shrinkee_NBAC.gif"></td>
      </tr>
      ...

您可以看到您所在的表位于<div id="scheduleMain">内。

您应该选择表格行(<tr>),而不是for循环中的表格单元格,并且在每次循环迭代中,选择时间和日期的单元格:

def parse(self, response):
    for game in response.css('#scheduleMain > table tr:nth-child(n+3)'):
        yield {
            'date': game.css('td.dt::text').extract(),
            'time': game.css('td.tm::text').extract(),
        }

tr:nth-child(n+3)用于选择行3,4,5 ......(前2行是标题)