如何使用Scrapy从变量中提取文本?

时间:2017-07-30 20:38:22

标签: python scrapy

我正在使用Scrapy抓取业务目录,并且遇到了尝试使用变量提取数据的问题。这是代码:

    def parse_page(self, response):
    url = response.meta.get('URL')

    # Parse the locations area of the page
    locations = response.css('address::text').extract()
    # Takes the City and Province and removes unicode and removes whitespace,
    # they are still together though.
    city_province = locations[1].replace(u'\xa0', u' ').strip()
    # List of all social links that the business has
    social = response.css('.entry-content > div:nth-child(2) a::attr(href)').extract()

    add_info = response.css('ul.list-border li').extract()
    year = ""

    for info in add_info:
        if 'Year' in info:
            year = info
        else:
            pass

    yield {
        'title': response.css('h1.entry-title::text').extract_first().strip(),
        'description': response.css('p.mb-double::text').extract_first(),
        'phone_number': response.css('div.mb-double ul li::text').extract_first(default="").strip(),
        'email': response.css('div.mb-double ul li a::text').extract_first(default=""),
        'address': locations[0].strip(),
        'city': city_province.split(' ', 1)[0].replace(',', ''),
        'province': city_province.split(' ', 1)[1].replace(',', '').strip(),
        'zip_code': locations[2].strip(),
        'website': response.css('.entry-content > div:nth-child(2) > ul:nth-child(2) > li:nth-child(1) > a:nth-child(1)::attr(href)').extract_first(default=''),
        'facebook': response.css('.entry-content > div:nth-child(2) > ul:nth-child(2) > li:nth-child(2) > a:nth-child(1)::attr(href)').extract_first(default=''),
        'twitter': response.css('.entry-content > div:nth-child(2) > ul:nth-child(2) > li:nth-child(3) > a:nth-child(1)::attr(href)').extract_first(default=''),
        'linkedin': response.css('.entry-content > div:nth-child(2) > ul:nth-child(2) > li:nth-child(4) > a:nth-child(1)::attr(href)').extract_first(default=''),
        'year': year,
        'employees': response.css('.list-border > li:nth-child(2)::text').extract_first(default="").strip(),
        'key_contact': response.css('.list-border > li:nth-child(3)::text').extract_first(default="").strip(),
        'naics': response.css('.list-border > li:nth-child(4)::text').extract_first(default="").strip(),
        'tags': response.css('ul.biz-tags li a::text').extract(),
    }

我遇到的问题是来自这里:

        add_info = response.css('ul.list-border li').extract()
        year = ""

        for info in add_info:
            if 'Year' in info:
                year = info
            else:
                pass

代码检查信息是否为“Year Established”。但是,它返回HTML。我试图得到它,以便它打印出年份。 add_info = response.css('ul.list-border li::text').extract()将打印出年份,但如何在for循环中执行此操作?

每当“年”在info时,它就会输出如下:<li><span>Year Established:</span> 1998</li>。我希望得到年份,而不是HTML。

1 个答案:

答案 0 :(得分:1)

添加以下功能。

def getYear(yearnum):
    yearnum1 = str(yearnum[35:])
    yearnum2 = str(yearnum1[:4])
    return yearnum2

然后用以下内容替换你的for语句。

for info in add_info:
    if 'Year' in info:
        yearanswer = getYear(info)
    else:
        pass

然后它将取出长字符串中的4位数字并将其放入字符串yearanswer中。 如果你打印yearanswer应该打印1998.它为我做了!