当我尝试从循环内的url打印dl引用时获取空字符串

时间:2017-01-23 15:31:05

标签: python html web-scraping

我已经创建了一个循环,该循环贯穿以下结果页面https://beta.companieshouse.gov.uk/search/companies?q=SW181Db&page=1

我现在想按顺序打开结果页面中的网址并从中抓取数据。结果页面示例https://beta.companieshouse.gov.uk/company/08569390

我希望通过定义properties_col,通过按照下面的代码对列进行分类,它会生成标签的内容,但它只是给我,我相信是一个空白字符串[]。 python中的输出是x 25

我的完整代码如下。有任何想法吗?谢谢和问候

Sub RoutingCheck()

    Dim I As Long, r1 As Range, r2 As Range

    For I = 2 To 456

        Set r1 = Range("A" & I)

        Set r2 = Range("B" & I)

        If r1.Value = 94 And r2.Value = -99 Then r2.Interior.Color = vbRed

    Next I

    'Error

    End Sub

2 个答案:

答案 0 :(得分:0)

base_url = 'https://beta.companieshouse.gov.uk'
  • 更改基本网址

  • 删除末尾的斜杠

首先输出:

https://beta.companieshouse.gov.uk/company/08569390
[<dl class="column-two-thirds">\n            <dt>Company status</dt>\n            <dd class="text data" id="company-status">\n                Dissolved\n            </dd>\n        </dl>, <dl class="column-two-thirds">\n            <dt>Company type</dt>\n            <dd class="text data" id="company-type">\n                Private limited Company\n            </dd>\n        </dl>]

答案 1 :(得分:0)

from urllib.parse import urljoin
base_url = 'https://beta.companieshouse.gov.uk/'
href = '/company/08569390'
urljoin(base_url, href)

出:

'https://beta.companieshouse.gov.uk/company/08569390'

/中有额外的base_url,使用urljoin来避免此问题。

如果您在网址中使用+,则输出为:

'https://beta.companieshouse.gov.uk//company/08569390'