我已经创建了一个循环,该循环贯穿以下结果页面https://beta.companieshouse.gov.uk/search/companies?q=SW181Db&page=1
我现在想按顺序打开结果页面中的网址并从中抓取数据。结果页面示例https://beta.companieshouse.gov.uk/company/08569390
我希望通过定义properties_col,通过按照下面的代码对列进行分类,它会生成标签的内容,但它只是给我,我相信是一个空白字符串[]。 python中的输出是x 25
我的完整代码如下。有任何想法吗?谢谢和问候
Sub RoutingCheck()
Dim I As Long, r1 As Range, r2 As Range
For I = 2 To 456
Set r1 = Range("A" & I)
Set r2 = Range("B" & I)
If r1.Value = 94 And r2.Value = -99 Then r2.Interior.Color = vbRed
Next I
'Error
End Sub
答案 0 :(得分:0)
base_url = 'https://beta.companieshouse.gov.uk'
更改基本网址
删除末尾的斜杠
首先输出:
https://beta.companieshouse.gov.uk/company/08569390
[<dl class="column-two-thirds">\n <dt>Company status</dt>\n <dd class="text data" id="company-status">\n Dissolved\n </dd>\n </dl>, <dl class="column-two-thirds">\n <dt>Company type</dt>\n <dd class="text data" id="company-type">\n Private limited Company\n </dd>\n </dl>]
答案 1 :(得分:0)
from urllib.parse import urljoin
base_url = 'https://beta.companieshouse.gov.uk/'
href = '/company/08569390'
urljoin(base_url, href)
出:
'https://beta.companieshouse.gov.uk/company/08569390'
/
中有额外的base_url
,使用urljoin
来避免此问题。
如果您在网址中使用+
,则输出为:
'https://beta.companieshouse.gov.uk//company/08569390'