python scrapy - 从onclick弹出对话框中抓取

时间:2017-10-16 09:54:46

标签: javascript jquery python scrapy

我试图使用scrapy和python从this site抓取所有视频和英语成绩单的链接

我让蜘蛛从所有页面中抓取所有视频网址(NB。我在编程时没用),但我无法弄清楚如何刮取成绩单。只有在单击按钮后才会弹出脚本对话框。在这个新的弹出窗口中可以找到成绩单的链接。我读过的所有其他教程都解决了POST请求,但似乎这是一个ajax GET请求。 (所以我完全无能为力)。我也看过提及有效载荷和表单控件的帖子,但我不知道它们对于这个站点是什么

按钮点击前页面中的相关HTML:



    <span class="transcription make-cursor" onclick="showTranscriptionDialog('17394')"> 
<img class="video-doclet-icons" src="images/transcript4.png" 
title="Download Transcription, Tercüme'yi indir, تحميل النص" 
alt="Transcription" data-pin-nopin="true"></span>
&#13;
&#13;
&#13;

点击(对话框弹出窗口)后

相关HTML:

&#13;
&#13;
    <span class="ui-corner-all" id="transcription-language-list17394" 
style="background-color: rgb(245, 243, 229); color: rgb(51, 51, 51);"> 
<a class="transcription-language-list" target="_blank" 
href="http://saltanat-transcriptions.s3.amazonaws.com/english/2017-08-08_en_NothingMeansEverything_SB.pdf" 
onmouseover="transcriptionLanguageMouseOver(17394)" 
onmouseout="transcriptionLanguageMouseOut(17394)" 
style="color: rgb(51, 51, 51);"> English </a></span>
&#13;
&#13;
&#13;

我当前的蜘蛛代码(无效)

&#13;
&#13;
import scrapy

class SuhbaSpider(scrapy.Spider):
    name = "suhbas"
    start_urls = ["http://saltanat.org/videos.php?topic=SheikhBahauddin&gopage={numb}".format(numb=numb)
		for numb in range(1,23)]

    def parse(self, response):
			yield {
                'video': response.xpath('//span[@class='download make-cursor']/a/@href').extract(),
            }
		videoid = response.xpath("substring(//span[@class='media-info make-cursor']/@onclick, 22, 5)").extract()
        for p in videoid:
            url = "http://saltanat.org/ajax_transcription.php?vid=" + p
            yield scrapy.Request(url, callback=self.parse_transcript)

    def parse_transcript(self, response):
            yield {
                'transcript': response.xpath('//a[contains(@href,'english')]/@href').extract(),
            }
&#13;
&#13;
&#13;

任何帮助将不胜感激,谢谢!

1 个答案:

答案 0 :(得分:0)

好的,在使用代码后我得到了一个有效的解决方案,问题是&#34; substring&#34;命令。它不应该放在&#34; response.xpath&#34;线。我使用了另一种语法来执行如下所示的相同操作(即获取子字符串)

不工作

&#13;
&#13;
videoid = response.xpath("substring(//span[@class='media-info make-cursor']/@onclick, 22, 5)").extract()
        for p in videoid:
            url = "http://saltanat.org/ajax_transcription.php?vid=" + p
&#13;
&#13;
&#13;

替换为此工作部分

&#13;
&#13;
fullvideoid = response.xpath("//span[@class='media-info make-cursor']/@onclick").extract()
    
	for videoid in fullvideoid:
		url = ("http://saltanat.org/ajax_transcription.php?vid=" + videoid[21:-2])
&#13;
&#13;
&#13;