我已经写了一些代码来废弃英国公司大楼网站的部分内容。有时某些字段不存在,因此在代码中有一个IF Else语句检查Xpath是否存在,如果不存在则返回“n / a”到该变量。如果我没有这样做,那么我的列表就会失去平衡,我会开始为每个人返回错误的出生日期(换句话说,我必须强制使用dateofbirths变量来取一个字符串以保持一切顺序)
我遇到的问题是代码
dateofbirths = "n/a"
只返回第一个字母(即在这种情况下,我在操作时得到字符串“n”而不是完整的“n / a”。
有谁知道为什么会这样?
完整代码位于
之下import scrapy
import re
from CompaniesHouse.items import CompanieshouseItem
class CompaniesHouseSpider(scrapy.Spider):
name = "companieshouse"
allowed_domains = ["companieshouse.gov.uk"]
start_urls = ["https://beta.companieshouse.gov.uk/company/OC361003/officers",
]
def parse(self, response):
for count in range(0,100):
for sel in response.xpath('//*[@id="content-container"]'):
string1 = "officer-name-" + str(count)
names = sel.xpath('//*[@id="%s"]/a/text()' %string1).extract()
names = [name.strip() for name in names]
namerefs = sel.xpath('//*[@id="%s"]/a/@href' %string1).re(r'(?<=/officers/).*?(?=/appointments)')
namerefs = [nameref.strip() for nameref in namerefs]
string2 = "officer-role-" + str(count)
roles = sel.xpath('//*[@id="%s"]/text()' %string2).extract()
roles = [role.strip() for role in roles]
string3 = "officer-date-of-birth-" + str(count)
if sel.xpath('//*[@id="%s"]/text()' %string3):
dateofbirths = sel.xpath('//*[@id="%s"]/text()' %string3).extract()
else:
dateofbirths = "n/a"
dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]
result = zip(names, namerefs, roles, dateofbirths)
for name, nameref, role, dateofbirth in result:
item = CompanieshouseItem()
item['name'] = name
item['nameref'] = nameref
item['role'] = role
item['dateofbirth'] = dateofbirth
yield item
next_page = response.xpath('//*[@class="pager"]/li/a[@class="page"][contains(., "Next")]/@href').extract()
if next_page:
next_href = next_page[0]
next_page_url = "https://beta.companieshouse.gov.uk" + next_href
request = scrapy.Request(url=next_page_url)
yield request
答案 0 :(得分:2)
因为dateofbirths
是一个字符串?:
>>> dateofbirths = "n/a"
>>> [dateofbirth.strip() for dateofbirth in dateofbirths]
['n', '/', 'a']
尝试:
>>> dateofbirths = ["n/a"]
>>> [dateofbirth.strip() for dateofbirth in dateofbirths]
['n/a']