字符串变量只占用字符串的第一个字母而不是整个单词

时间:2016-12-18 03:14:06

标签: python scrapy

我已经写了一些代码来废弃英国公司大楼网站的部分内容。有时某些字段不存在,因此在代码中有一个IF Else语句检查Xpath是否存在,如果不存在则返回“n / a”到该变量。如果我没有这样做,那么我的列表就会失去平衡,我会开始为每个人返回错误的出生日期(换句话说,我必须强制使用dateofbirths变量来取一个字符串以保持一切顺序)

我遇到的问题是代码

dateofbirths = "n/a"

只返回第一个字母(即在这种情况下,我在操作时得到字符串“n”而不是完整的“n / a”。

有谁知道为什么会这样?

完整代码位于

之下
import scrapy
import re

from CompaniesHouse.items import CompanieshouseItem

class CompaniesHouseSpider(scrapy.Spider):
    name = "companieshouse"
    allowed_domains = ["companieshouse.gov.uk"]
    start_urls = ["https://beta.companieshouse.gov.uk/company/OC361003/officers",
]

    def parse(self, response):
        for count in range(0,100):
            for sel in response.xpath('//*[@id="content-container"]'):
                string1 = "officer-name-" + str(count)
                names = sel.xpath('//*[@id="%s"]/a/text()' %string1).extract()
                names = [name.strip() for name in names]
                namerefs = sel.xpath('//*[@id="%s"]/a/@href' %string1).re(r'(?<=/officers/).*?(?=/appointments)')
                namerefs = [nameref.strip() for nameref in namerefs]
                string2 = "officer-role-" + str(count)
                roles = sel.xpath('//*[@id="%s"]/text()' %string2).extract()
                roles = [role.strip() for role in roles]
                string3 = "officer-date-of-birth-" + str(count)
                if sel.xpath('//*[@id="%s"]/text()' %string3):
                    dateofbirths = sel.xpath('//*[@id="%s"]/text()' %string3).extract()
                else:
                    dateofbirths = "n/a"
                dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]
                result = zip(names, namerefs, roles, dateofbirths)
                for name, nameref, role, dateofbirth in result:
                   item = CompanieshouseItem()
                   item['name'] = name
                   item['nameref'] = nameref
                   item['role'] = role
                   item['dateofbirth'] = dateofbirth               
                   yield item

        next_page = response.xpath('//*[@class="pager"]/li/a[@class="page"][contains(., "Next")]/@href').extract()
        if next_page:
            next_href = next_page[0]
            next_page_url = "https://beta.companieshouse.gov.uk" + next_href
            request = scrapy.Request(url=next_page_url)
            yield request

1 个答案:

答案 0 :(得分:2)

因为dateofbirths是一个字符串?:

>>> dateofbirths = "n/a"
>>> [dateofbirth.strip() for dateofbirth in dateofbirths]
['n', '/', 'a']

尝试:

>>> dateofbirths = ["n/a"]
>>> [dateofbirth.strip() for dateofbirth in dateofbirths]
['n/a']