使用BeautifulSoup搜索字符串并写入mongodb

时间:2014-04-05 18:30:14

标签: python mongodb python-2.7 beautifulsoup

我需要reslove 3问题:

  1. 我试图在python中编写一个简单的程序来解析表id =" dgContract"的表格的网页。

  2. 将此页面的第1页,第2页,第3页......第n页存储到mongodb,我不知道如何操作mongodb。

  3. 解析所选"细节"内容存储链接到mongodb.if打开选择"查看" ,需要像http://www.xxx.com/

    一样添加http://www.xxx.com/LicConDisp.aspx?CID=xxxxx

    在新窗口中打开图片会清除!!!

  4. THIS is source table snap

    我的代码:

    import urllib2,cookielib,sys                                                           
    import urllib,string                                                                   
    import cStringIO,Image,re                                                                               
    import BeautifulSoup          # For processing HTML                                    
    from BeautifulSoup import BeautifulStoneSoup     # For processing XML                  
    from BeautifulSoup import BeautifulSoup
    import configparser                                                                    
    from pymongo import Connection                                                         
    import codecs                                                                          
    import sitecustomize                                                                   
    import chardet
    
    host = 'localhost'      
    database = 'test'       
    collection = 'compinfo' 
    
    f=file('copy of out4.html','r')              
    html=f.read()                                
    soup = BeautifulSoup(''.join(html))          
    table = soup.find('table', id="dgContract")  
    rows = table.findAll('tr')                   
    store = []                                   
    for tr in rows:                              
      cols = tr.findAll('td')                    
      row = []                                   
      for td in cols:                            
        try:                                     
          row.append(''.join(td.find(text=True)))
        except Exception:                        
          row.append('')                         
      store.append('|'.join(row))                
    
    print '\n'.join(store)     
    

    但输出如下:在新窗口中打开图片会清除!!!

    run Python code result .............!!!

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
    

    <HEAD>
    
        <title>查询</title>
    
        <meta content="Microsoft Visual Studio .NET 7.1" name="GENERATOR">
    
        <meta content="C#" name="CODE_LANGUAGE">
    
        <meta content="JavaScript" name="vs_defaultClientScript">
    
        <meta content="http://schemas.microsoft.com/intellisense/ie5" name="vs_targetSchema">
    
        <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" />
    
    
        <link href="css/user.css" type="text/css" rel="stylesheet">
    
        <style type="text/css">
    
        .STYLE1 {FONT-SIZE: 12px; COLOR: #ffffff}
    
        .STYLE2 {FONT-SIZE: 14px; COLOR: #000000}
    
        .STYLE45 {FONT-SIZE: 12px}
    
        .STYLE51 {FONT-WEIGHT: bold; FONT-SIZE: 12px; FONT-FAMILY: "宋体"}
    
        .STYLE52 {FONT-WEIGHT: bold; FONT-SIZE: 12px; COLOR: #ffffff; FONT-FAMILY: "宋体"}
    
        </style>
    
    </HEAD>
    
    <body background="images/bg.jpg" MS_POSITIONING="GridLayout">
    
        <form name="Form1" method="post" action="ContractSearcher.aspx" id="Form1">
    
            <div align="center">
    
                <table borderColor="#c7c7c7" cellSpacing="0" cellPadding="0" border="1">
    
                    <tr>
    
                        <td class="tdBorder">
    
                            <!-- content -->
    
                            <!--显示用户信息条 -->
                            <!--内容主体:左侧为菜单,右侧为内容显示区 -->
    
                            <table height="350" cellSpacing="0" cellPadding="0" width="760" border="0">
                                <tr>
                                    <!--左侧菜单项 -->
                                    <td width="3">&nbsp;</td>
                                    <!--右侧内容显示区 -->
                                    <td vAlign="top" width="815" bgColor="#ffffff">
                                        <table width="100%">
                                            <tr>
                                                <td class="tdbigmidcenter">
                                                <table class="tablebigContent" cellspacing="0" rules="all" border="1" id="dgContract" width="815">
                                                    <tr bgcolor="PapayaWhip">
                                                    <td>numb</td><td>用户1</td><td>用户2</td>
                                                    <td>作者</td>
                                                    <td align="center">接受时间</td><td align="center">发送</td>
                                                    <td align="center">详情</td>
                                                    <td align="center">状态</td>
                                                    <td>version</td>
                                                   </tr>
                                                    <tr>
                                                    <td width="21%">HOPE-HT-YX-S-140331-120</td><td width="14%">
    
                                                                    A公司
    
                                                                </td><td width="14%">A学校</td><td width="5%">david</td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-3-31 15:19:45">2014-3-31</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    未发送
    
    
    
                                                                </td><td align="center" width="7%">
    
                                                                    <a href="LicConDisp.aspx?CID=91e13d7a-e812-428d-a5c2-532778ea4e89" target="_blank">查看</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a title="" href="ConStatusDisp.aspx?CID=91e13d7a-e812-428d-a5c2-532778ea4e89">已结束[<font color=red>通过</font>]</a>
    
                                                                </td><td align="center" width="5%">
    
                                                                    1.0
    
                                                                </td>
    
    </tr><tr>
    
        <td width="21%">HOPE-HT-YX-S-140328-106</td>
        <td width="14%">
    
                                                                    A公司
    
                                                                </td>
        <td width="14%">M公司</td><td width="5%">王明</td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-3-28 16:16:53">2014-3-28</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    未发货
    
    
    
                                                                </td><td align="center" width="7%">
    
                                                                    <a href="LicConDisp.aspx?CID=72648278-dbe3-4577-9154-23182e349b33" target="_blank">查看</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a title="" href="ConStatusDisp.aspx?CID=72648278-dbe3-4577-9154-23182e349b33">已结束[<font color=red>HOPECE140328-5 </font>]</a>
    
                                                                </td><td align="center" width="5%">
    
                                                                    1.0
    
                                                                </td>
    
    </tr><tr>
    
        <td width="21%">&nbsp;</td>
        <td width="14%">
    
                                                                    B公司
    
                                                                </td>
        <td width="14%">C中心</td><td width="5%">王明</td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-3-12 15:07:12">2014-3-12</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-3-28 10:28:37">2014-3-28</a><br>
    
                                                                    [<font color=deeppink><strong>全<strong></font>]
    
                                                                </td><td align="center" width="7%">
    
                                                                    <a href="LicConDisp.aspx?CID=0526a587-85dc-484e-88f4-87967546678f" target="_blank">查看</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a title="9900014479" href="ConStatusDisp.aspx?CID=0526a587-85dc-484e-88f4-87967546678f">已结束[<font color=red>HOPETE140313-1 </font>][<font color=deeppink><strong>全<strong></font>]</a>
    
                                                                </td><td align="center" width="5%">
    
                                                                    1.0
    
                                                                </td>
    
    </tr><tr>
    
        <td width="21%">HOPE-HT-YX-S-140306-001</td>
        <td width="14%">
    
                                                                    A公司
    
                                                                </td>
        <td width="14%">A中心</td>
        <td width="5%">JACK</td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-3-7 9:48:47">2014-3-7</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-3-28 10:28:46">2014-3-28</a><br>
    
                                                                    [<font color=deeppink><strong>全<strong></font>]
    
                                                                </td><td align="center" width="7%">
    
                                                                    <a href="LicConDisp.aspx?CID=dfec1630-e1d4-478c-9feb-415eedbd6184" target="_blank">查看</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a title="9900014479" href="ConStatusDisp.aspx?CID=dfec1630-e1d4-478c-9feb-415eedbd6184">已结束[<font color=red>HOPETE140317-4 </font>][<font color=deeppink><strong>全<strong></font>]</a>
    
                                                                </td><td align="center" width="5%">
    
                                                                    1.0
    
                                                                </td>
    
    </tr><tr>
    
        <td width="21%">HOPE-HT-YX-S-140228-102</td>
        <td width="14%">
    
                                                                    G公司
    
                                                                </td>
        <td width="14%">F公司</td>
        <td width="5%">david</td><td align="center" width="10%">
    
                                                                    未通过
    
                                                                </td><td align="center" width="10%">
    
                                                                    未发货
    
    
    
                                                                </td><td align="center" width="7%">
    
                                                                    <a href="LicConDisp.aspx?CID=9e19e1c9-7644-4392-9bdd-89e2bac346cd" target="_blank">查看</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a title="" href="ConStatusDisp.aspx?CID=9e19e1c9-7644-4392-9bdd-89e2bac346cd">已作废</a>
    
                                                                </td><td align="center" width="5%">
    
                                                                    1.0
    
                                                                </td>
    
    </tr><tr>
    
        <td width="21%">HOPE-HT-YX-S-140228-005</td>
        <td width="14%">
    
                                                                    T公司
    
                                                                </td>
        <td width="14%">J公司 </td>
        <td width="5%">jack</td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-2-28 14:54:26">2014-2-28</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-3-28 10:55:11">2014-3-28</a><br>
    
                                                                    [<font color=deeppink><strong>全<strong></font>]
    
                                                                </td><td align="center" width="7%">
    
                                                                    <a href="LicConDisp.aspx?CID=45039bfb-ccb8-49f4-b8fe-27bc8cf59803" target="_blank">查看</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a title="9900014480" href="ConStatusDisp.aspx?CID=45039bfb-ccb8-49f4-b8fe-27bc8cf59803">已结束[<font color=red>HOPECE140228-10</font>][<font color=deeppink><strong>全<strong></font>]</a>
    
                                                                </td><td align="center" width="5%">
    
                                                                    1.0
    
                                                                </td>
    
    </tr><tr>
    
        <td width="21%">HOPE-HT-YX-S-140228-002</td>
        <td width="14%">
    
                                                                    S公司
    
                                                                </td>
        <td width="14%">V公司</td>
        <td width="5%">张军</td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-2-28 14:13:23">2014-2-28</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-3-28 10:28:54">2014-3-28</a><br>
    
                                                                    [<font color=deeppink><strong>全<strong></font>]
    
                                                                </td><td align="center" width="7%">
    
                                                                    <a href="LicConDisp.aspx?CID=02a8d406-a826-4a5a-b466-f4bca2640307" target="_blank">查看</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a title="9900014479" href="ConStatusDisp.aspx?CID=02a8d406-a826-4a5a-b466-f4bca2640307">已结束[<font color=red>HOPETE140307-4 </font>][<font color=deeppink><strong>全<strong></font>]</a>
    
                                                                </td><td align="center" width="5%">
    
                                                                    1.0
    
                                                                </td>
    
    </tr><tr>
    
        <td width="21%">&nbsp;</td>
        <td width="14%">
    
                                                                    A公司
    
                                                                </td>
        <td width="14%">W公司</td><td width="5%">jack</td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-2-28 14:13:38">2014-2-28</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-3-28 10:29:03">2014-3-28</a><br>
    
                                                                    [<font color=deeppink><strong>全<strong></font>]
    
                                                                </td><td align="center" width="7%">
    
                                                                    <a href="LicConDisp.aspx?CID=2684c70a-baea-4da4-911b-19cdbe627fd9" target="_blank">查看</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a title="9900014479" href="ConStatusDisp.aspx?CID=2684c70a-baea-4da4-911b-19cdbe627fd9">已结束[<font color=red>HOPETE140307-3 </font>][<font color=deeppink><strong>全<strong></font>]</a>
    
                                                                </td><td align="center" width="5%">
    
                                                                    1.0
    
                                                                </td>
    
    </tr><tr>
    
        <td width="21%">HOPE-HT-YX-S-140228-013</td>
        <td width="14%">
    
                                                                    B公司
    
                                                                </td><td width="14%">V公司</td>
                          <td width="5%">rose</td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-2-28 14:19:28">2014-2-28</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-3-28 10:28:23">2014-3-28</a><br>
    
                                                                    [<font color=deeppink><strong>全<strong></font>]
    
                                                                </td><td align="center" width="7%">
    
                                                                    <a href="LicConDisp.aspx?CID=1204ad5e-4552-43af-a650-19b93f9d2514" target="_blank">查看</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a title="9900014479" href="ConStatusDisp.aspx?CID=1204ad5e-4552-43af-a650-19b93f9d2514">已结束[<font color=red>HOPETE140307-2 </font>][<font color=deeppink><strong>全<strong></font>]</a>
    
                                                                </td><td align="center" width="5%">
    
                                                                    1.0
    
                                                                </td>
    
    </tr><tr>
    
        <td width="21%">HOPE-HT-YX-S-140226-018</td>
        <td width="14%">
    
                                                                    C公司
    
                                                                </td><td width="14%">A中心</td>
                          <td width="5%">david</td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-2-26 14:56:14">2014-2-26</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a href="#" title="2014-3-14 15:27:44">2014-3-14</a><br>
    
                                                                    [<font color=deeppink><strong>全<strong></font>]
    
                                                                </td><td align="center" width="7%">
    
                                                                    <a href="LicConDisp.aspx?CID=a04cdcd2-5b5c-4182-a22d-a29399ab6991" target="_blank">查看</a>
    
                                                                </td><td align="center" width="10%">
    
                                                                    <a title="9900014388" href="ConStatusDisp.aspx?CID=a04cdcd2-5b5c-4182-a22d-a29399ab6991">已结束[<font color=red>HOPECE140228-1 </font>][<font color=deeppink><strong>全<strong></font>]</a>
    
                                                                </td><td align="center" width="5%">
    
                                                                    1.0
    
                                                                </td>
    
    </tr><tr align="right">
    
        <td colspan="9"><span>1</span>&nbsp;<a href="javascript:__doPostBack('dgContract$_ctl14$_ctl1','')">2</a>&nbsp;<a href="javascript:__doPostBack('dgContract$_ctl14$_ctl2','')">3</a></td>
    
    </tr>
    

0 个答案:

没有答案