Python Web Scraping:在href内只读取那些具有" http"在里面

时间:2017-01-14 03:58:26

标签: python web-scraping

我正试图抓一个网页只是为了学习。在该网页中有多个" a"标签。考虑以下代码

<a href='\abc\def\jkl'> Something </a>
<a href ='http://www.google.com'> Something</a>

现在我想只读取那些有http的href属性。 我的当前代码是

for link in soup.find_all("a"):
    print link.get("href")

我想将其更改为只读#34; http&#34;链接。

4 个答案:

答案 0 :(得分:2)

可以使用这样的正则表达式:

import re
from bs4 import BeautifulSoup

res = """<a href="\abc\def\jkl">Something</a>
<a href="http://www.google.com">something</a>"""

soup = BeautifulSoup(res)
print soup.find_all('a', {'href' : re.compile('^http:.*')})

输出:

[<a href="http://www.google.com">something</a>]

答案 1 :(得分:2)

您也可以使用“以CSS selector开头:

print([a["href"] for a in soup.select('a[href^=http]')])

演示:

In [1]: from bs4 import BeautifulSoup

In [2]: res = """
   ...: <a href="\abc\def\jkl">Something</a>
   ...: <a href="http://www.google.com">something</a>
   ...: """

In [3]: soup = BeautifulSoup(res, "html.parser")

In [4]: print([a["href"] for a in soup.select('a[href^=http]')])
[u'http://www.google.com']

答案 2 :(得分:1)

只需运行此简单测试即可查看该链接是否包含字符串http。您的代码中需要一行额外的行来执行此操作:

for link in soup.find_all('a'):
    if 'http' in link.get('href'):
        print(link.get('href'))

答案 3 :(得分:0)

另一种方法:

callback: function (result) {
    if(result){
        var forTempValue = "";
        var forFieldValue = "";
        var forMandatory = "";
        var forDataImageNoApp = "";
        var loopInt = 0;
        $('input[type=textbox][name^=tmpField],select[name^=tmpField],input[type=text][name^=tmpField]').each(
                function(index){  
                    var input = $(this);
                    forTempValue+=("'" + input.prop('value').replace(/&/g,"") + "' Col" + loopInt + ",");
                    forFieldValue+=("'" + input.prop('value').replace(/&/g,"") + "'" + "±");
                    loopInt++;
                }
        );
        $('input[type=hidden][name^=MandatoryName]').each(
                function(index){  
                    var input = $(this);
                    forMandatory+=(input.val() + ",");
                }
        );
        $('input[type=hidden][name^=tempNoAppLoop]').each(
                function(index){  
                    var input = $(this);
                    forDataImageNoApp+=(input.val() + ",");
                }
        );
        $.ajax({
            type: "POST",
        url: "../ESSCheckerMakerOperation",
        data: "sTabID=<%=sTabID%>&sFieldNameApp=<%=sFieldNameApp%>",
        success: function(msg){
            <%
            String getSeqNo = "SELECT MAX(fldSeqno) FROM "+MappingID+"";
            String finalSeqno="";
            ResultSet rs = aDbManager.retrieveRec(getSeqNo);
            if(rs.next()){
                finalSeqno=rs.getString(1);
            }rs.close();
            String cmd="EXEC sp_RelationshipValidation  '"+MappingID+"'," + finalSeqno;
            aDbManager.SQLTransaction(cmd);
            String sqlgetSeq= "Select fldCommonErrorFlag " +
                      " From tblGeneralError" +
                      " Where fdSeqNo '" +finalSeqno  + "' "+
                      " And fldTableName = '" +MappingID  + "' ";


        PrintDebug.println("Final SeqNo: "+finalSeqno);
        %>
        <%ResultSet rsVal = aDBManager.retrieveRec(sqlgetSeq);%>

        <%if(rsVal.next()){ %>

            document.getElementById('lblError').style.visibility = 'visible';
            document.getElementById('lblError').innerHTML = <%=finalSeqno%>;
            Ko
            <%}else{%>  

            bootbox.alert({
                title:"Record is successfully ",
                message:"There is a problem with the validation for" + <%=finalSeqno%> ,
            });
            <%
            }%>
        //}
        },
        error: function(msg){
            bootbox.alert({
                title:"Error",
                message: "Failed to save the record.",
            });
        }
        });
    };
}
});

此处链接['href']将获取href标记内的所有文本。