从网页中提取文本的一部分

时间:2017-12-03 18:09:59

标签: javascript python

我试图为特定类型的报告生成getfile.do字符串列表 - “技术”报告。问题是报告“type”被标记在字符串的末尾,因此代码应该读取字符串,如果报告标记检出它应该返回并提取getfile.do(action)。

以下是来自源网页(WWW.SEDAR.COM)的示例(有很多不需要的东西,但下面是我想要的。所以如果我们在下面的代码中阅读“技术报告”,我想要提取动作信息(所以我可以用它来下载文档)。问题是许多页面链接都不相关。

FORM名称= “form1512323126173” 行动= “/的 GetFile.do LANG = EN&安培; docClass = 24&安培; issuerNo = 00021020&安培; issuerType = 03&安培; projectNo = 02627564&安培; d CID = 4117642 ” 方法= “post”target =“AcceptTermsOfUse”P HREF =“javascript:submitFiling(document.form1512323126173,'AcceptTermsOf use');”title =“& docClass = 24& issuerNo = 00021020& issuerType = 03& projectNo = 026 7564& docId = 4117642 “的onmouseover =” window.status = '&安培; docClass = 24&安培; issuerNo = 000 1020&安培; issuerType = 03&安培; projectNo = 02627564&安培;的docId = 4117642';返回true; “的onmouseout =” window.status = '';返回true;“> 技术报告(NI 43101)

以下是我不感兴趣的示例(在同一页面上):

FORM name =“form1512323126172”action =“/ GetFile.do lang = EN& docClass = 24& issuerNo = 00021020& issuerType = 03& projectNo = 02627564& d cId = 4117645”method =“post”target =“AcceptTermsOfUse “PA HREF =”javascript:submitFiling(document.form1512323126172,'AcceptTermsOfU e');“title =”& docClass = 24& issuerNo = 00021020& issuerType = 03& projectNo = 02627 64& docId = 4117645“onmouseover =”window .status ='& docClass = 24& issuerNo = 00021020& issuerType = 03& projectNo = 02627564& docId = 4117645'; return true;“ onmouseout =“window.status =''; return true;”>合格人员的同意(NI 43-101)

总而言之,从上面的网页上看,我希望看到输出如下:

action =“/ GetFile.do lang = EN& docClass = 24& issuerNo = 00021020& issuerType = 03& p ojectNo = 02627564& d cId = 4117642

1 个答案:

答案 0 :(得分:0)

您可以将requests python库与Beautifulsoap

一起使用

按照以下命令安装第三方库::

pip install beautifulsoup4
pip install requests