寻求Alpha抓取电话会议记录问题

时间:2019-09-18 14:59:15

标签: web-scraping scrapy

我正在尝试从Seeking Alpha收集电话会议的笔录以进行一项研究项目(我是一名博士生)。现在,我在网上找到了一个代码来提取成绩单并将其存储在.json文件中。我已经调整了代码以轮换用户代理。但是,由于以下原因,该代码仅提取电话会议记录的第一页:

body = response.css('div#a-body p.p1')
chunks = body.css('p.p1')

页面由一系列<p>元素表示,类.p1 .p2 .p3等表示页面编号。我已经尝试了很多方法,例如将上面的代码替换为:

response.xpath('//div[@id="a-body"]/p')

但是我无法提取完整的电话会议记录(仅第一页)。下面是完整的代码:

import scrapy
# This enum lists the stages of each transcript.
from enum import Enum

import random
# SRC: https://developers.whatismybrowser.com/useragents/explore/
user_agent_list = [
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.94 Chrome/37.0.2062.94 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    #Firefox
    'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
]

Stage = Enum('Stage', 'preamble execs analysts body')
# Some transcript preambles are concatenated on a single line. This list is used
# To separate the title and date sections of the string.
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
transcripts = {}

class TranscriptSpider(scrapy.Spider):
    name = 'transcripts'
    custom_settings = {
        'DOWNLOAD_DELAY': 2 # 0.25 == 250 ms of delay, 1 == 1000ms of delay, etc.
    }
    start_urls = ['http://seekingalpha.com/earnings/earnings-call-transcripts/1']

    def parse(self, response):
        # Follows each transcript page's link from the given index page.
        for href in response.css('.dashboard-article-link::attr(href)').extract():
            user_agent = random.choice(user_agent_list)
            yield scrapy.Request(response.urljoin(href), callback=self.parse_transcript,headers={'User-Agent': user_agent})

        # Follows the pagination links at the bottom of given index page.
        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

    def parse_transcript(self, response):
        i = 0
        transcript = {}
        details = {}
        execs = []
        analysts = []
        script = []
        mode = 1

        # As the pages are represented by a series of `<p>` elements we have to do this the
        # old-fashioned way - breaking it into chunks and iterating over them.
        body = response.css('div#a-body p.p1')
        chunks = body.css('p.p1')
        while i < len(chunks):
            # If the current line is a heading and we're not currently going
            # through the transcript body (where headings represent speakers),
            # change the current section flag to the next section.
            if (len(chunks[i].css('strong::text').extract()) == 0) or (mode == 4):
                currStage = Stage(mode)
                # If we're on the preamble stage, each bit of data is extracted
                # separately as they all have their own key in the JSON.
                if currStage == Stage['preamble']:
                    # If we're on the first line of the preamble, that's the
                    # company name, stock exchange and ticker acroynm (or should
                    # be - see below)
                    if i == 0:
                        # Checks to see if the second line is a heading. If not,
                        # everything is fine.
                        if len(chunks[1].css('strong::text').extract()) == 0:
                            details['company'] = chunks[i].css('p::text').extract_first()
                            if " (" in details['company']:
                                details['company'] = details['company'].split(' (')[0]
                            # If a specific stock exchange is not listed, it
                            # defaults to NYSE
                            details['exchange'] = "NYSE"
                            details['ticker'] = chunks.css('a::text').extract_first()
                            if ":" in details['ticker']:
                                ticker = details['ticker'].split(':')
                                details['exchange'] = ticker[0]
                                details['ticker'] = ticker[1]
                        # However, if it is, that means this line contains the
                        # full, concatenated preamble, so everything must be 
                        # extracted here
                        else:
                            details['company'] = chunks[i].css('p::text').extract_first()
                            if " (" in details['company']:
                                details['company'] = details['company'].split(' (')[0]
                            # if a specific stock exchange is not listed, default to NYSE
                            details['exchange'] = "NYSE"
                            details['ticker'] = chunks.css('a::text').extract_first()
                            if ":" in details['ticker']:
                                ticker = details['ticker'].split(':')
                                details['exchange'] = ticker[0]
                                details['ticker'] = ticker[1]
                                titleAndDate = chunks[i].css('p::text').extract[1]
                                for date in months:
                                    if date in titleAndDate:
                                        splits = titleAndDate.split(date)
                                        details['title'] = splits[0]
                                        details['date'] = date + splits[1]
                    # Otherwise, we're onto the title line.
                    elif i == 1:
                        title = chunks[i].css('p::text').extract_first()
                        # This should never be the case, but just to be careful
                        # I'm leaving it in.
                        if len(title) <= 0:
                            title = "NO TITLE"
                        details['title'] = title
                    # Or the date line.
                    elif i == 2:
                        details['date'] = chunks[i].css('p::text').extract_first()
                # If we're onto the 'Executives' section, we create a list of
                # all of their names, positions and company name (from the 
                # preamble).
                elif currStage == Stage['execs']:                    
                    anExec = chunks[i].css('p::text').extract_first().split(" - ")
                    # This covers if the execs are separated with an em- rather
                    # than an en-dash (see above).
                    if len(anExec) <= 1:
                        anExec = chunks[i].css('p::text').extract_first().split(" – ")
                    name = anExec[0]
                    if len(anExec) > 1:
                        position = anExec[1]
                    # Again, this should never be the case, as an Exec-less
                    # company would find it hard to get much done.
                    else:
                        position = ""
                    execs.append((name,position,details['company']))
                # This does the same, but with the analysts (which never seem
                # to be separated by em-dashes for some reason).
                elif currStage == Stage['analysts']:
                    name = chunks[i].css('p::text').extract_first().split(" - ")[0]
                    company = chunks[i].css('p::text').extract_first().split(" - ")[1]
                    analysts.append((name,company))
                # This strips the transcript body of everything except simple
                # HTML, and stores that.
                elif currStage == Stage['body']:
                    line = chunks[i].css('p::text').extract_first()
                    html = "p>"
                    if line is None:
                        line = chunks[i].css('strong::text').extract_first()
                        html = "h1>"
                    script.append("<"+html+line+"</"+html)
            else:
                mode += 1
            i += 1

        # Adds the various arrays to the dictionary for the transcript
        details['exec'] = execs 
        details['analysts'] = analysts
        details['transcript'] = ''.join(script)

        # Adds this transcript to the dictionary of all scraped
        # transcripts, and yield that for the output
        transcript["entry"] = details
        yield transcript

我已经坚持了一个星期(对Python和Web抓取还是一个新手),所以如果有一个比我聪明的人可以看看,那就太好了!

1 个答案:

答案 0 :(得分:1)

似乎成绩单组织在各个页面中。

所以,我认为您必须在您的 parse_transcript 方法中添加一个部分,在该部分中您可以找到脚本下一页的链接,然后将其打开并提交给parse_transcript。

类似这样的东西:

# Follows the pagination links at the bottom of transcript page.
next_page = response.css(YOUR CSS SELECTOR GOES HERE).extract_first()
if next_page is not None:
   next_page = response.urljoin(next_page)
   yield scrapy.Request(next_page, callback=self.parse_transcript)

很显然,您必须修改您的 parse_transcript 方法,以便不仅分析从首页提取的段落。您必须使这一部分更笼统:

body = response.css('div#a-body p.p1')
chunks = body.css('p.p1')