我需要reslove 3问题:
我试图在python中编写一个简单的程序来解析表id =" dgContract"的表格的网页。
将此页面的第1页,第2页,第3页......第n页存储到mongodb,我不知道如何操作mongodb。
解析所选"细节"内容存储链接到mongodb.if打开选择"查看" ,需要像http://www.xxx.com/
一样添加http://www.xxx.com/LicConDisp.aspx?CID=xxxxx在新窗口中打开图片会清除!!!
我的代码:
import urllib2,cookielib,sys
import urllib,string
import cStringIO,Image,re
import BeautifulSoup # For processing HTML
from BeautifulSoup import BeautifulStoneSoup # For processing XML
from BeautifulSoup import BeautifulSoup
import configparser
from pymongo import Connection
import codecs
import sitecustomize
import chardet
host = 'localhost'
database = 'test'
collection = 'compinfo'
f=file('copy of out4.html','r')
html=f.read()
soup = BeautifulSoup(''.join(html))
table = soup.find('table', id="dgContract")
rows = table.findAll('tr')
store = []
for tr in rows:
cols = tr.findAll('td')
row = []
for td in cols:
try:
row.append(''.join(td.find(text=True)))
except Exception:
row.append('')
store.append('|'.join(row))
print '\n'.join(store)
但输出如下:在新窗口中打开图片会清除!!!
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<HEAD>
<title>查询</title>
<meta content="Microsoft Visual Studio .NET 7.1" name="GENERATOR">
<meta content="C#" name="CODE_LANGUAGE">
<meta content="JavaScript" name="vs_defaultClientScript">
<meta content="http://schemas.microsoft.com/intellisense/ie5" name="vs_targetSchema">
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" />
<link href="css/user.css" type="text/css" rel="stylesheet">
<style type="text/css">
.STYLE1 {FONT-SIZE: 12px; COLOR: #ffffff}
.STYLE2 {FONT-SIZE: 14px; COLOR: #000000}
.STYLE45 {FONT-SIZE: 12px}
.STYLE51 {FONT-WEIGHT: bold; FONT-SIZE: 12px; FONT-FAMILY: "宋体"}
.STYLE52 {FONT-WEIGHT: bold; FONT-SIZE: 12px; COLOR: #ffffff; FONT-FAMILY: "宋体"}
</style>
</HEAD>
<body background="images/bg.jpg" MS_POSITIONING="GridLayout">
<form name="Form1" method="post" action="ContractSearcher.aspx" id="Form1">
<div align="center">
<table borderColor="#c7c7c7" cellSpacing="0" cellPadding="0" border="1">
<tr>
<td class="tdBorder">
<!-- content -->
<!--显示用户信息条 -->
<!--内容主体:左侧为菜单,右侧为内容显示区 -->
<table height="350" cellSpacing="0" cellPadding="0" width="760" border="0">
<tr>
<!--左侧菜单项 -->
<td width="3"> </td>
<!--右侧内容显示区 -->
<td vAlign="top" width="815" bgColor="#ffffff">
<table width="100%">
<tr>
<td class="tdbigmidcenter">
<table class="tablebigContent" cellspacing="0" rules="all" border="1" id="dgContract" width="815">
<tr bgcolor="PapayaWhip">
<td>numb</td><td>用户1</td><td>用户2</td>
<td>作者</td>
<td align="center">接受时间</td><td align="center">发送</td>
<td align="center">详情</td>
<td align="center">状态</td>
<td>version</td>
</tr>
<tr>
<td width="21%">HOPE-HT-YX-S-140331-120</td><td width="14%">
A公司
</td><td width="14%">A学校</td><td width="5%">david</td><td align="center" width="10%">
<a href="#" title="2014-3-31 15:19:45">2014-3-31</a>
</td><td align="center" width="10%">
未发送
</td><td align="center" width="7%">
<a href="LicConDisp.aspx?CID=91e13d7a-e812-428d-a5c2-532778ea4e89" target="_blank">查看</a>
</td><td align="center" width="10%">
<a title="" href="ConStatusDisp.aspx?CID=91e13d7a-e812-428d-a5c2-532778ea4e89">已结束[<font color=red>通过</font>]</a>
</td><td align="center" width="5%">
1.0
</td>
</tr><tr>
<td width="21%">HOPE-HT-YX-S-140328-106</td>
<td width="14%">
A公司
</td>
<td width="14%">M公司</td><td width="5%">王明</td><td align="center" width="10%">
<a href="#" title="2014-3-28 16:16:53">2014-3-28</a>
</td><td align="center" width="10%">
未发货
</td><td align="center" width="7%">
<a href="LicConDisp.aspx?CID=72648278-dbe3-4577-9154-23182e349b33" target="_blank">查看</a>
</td><td align="center" width="10%">
<a title="" href="ConStatusDisp.aspx?CID=72648278-dbe3-4577-9154-23182e349b33">已结束[<font color=red>HOPECE140328-5 </font>]</a>
</td><td align="center" width="5%">
1.0
</td>
</tr><tr>
<td width="21%"> </td>
<td width="14%">
B公司
</td>
<td width="14%">C中心</td><td width="5%">王明</td><td align="center" width="10%">
<a href="#" title="2014-3-12 15:07:12">2014-3-12</a>
</td><td align="center" width="10%">
<a href="#" title="2014-3-28 10:28:37">2014-3-28</a><br>
[<font color=deeppink><strong>全<strong></font>]
</td><td align="center" width="7%">
<a href="LicConDisp.aspx?CID=0526a587-85dc-484e-88f4-87967546678f" target="_blank">查看</a>
</td><td align="center" width="10%">
<a title="9900014479" href="ConStatusDisp.aspx?CID=0526a587-85dc-484e-88f4-87967546678f">已结束[<font color=red>HOPETE140313-1 </font>][<font color=deeppink><strong>全<strong></font>]</a>
</td><td align="center" width="5%">
1.0
</td>
</tr><tr>
<td width="21%">HOPE-HT-YX-S-140306-001</td>
<td width="14%">
A公司
</td>
<td width="14%">A中心</td>
<td width="5%">JACK</td><td align="center" width="10%">
<a href="#" title="2014-3-7 9:48:47">2014-3-7</a>
</td><td align="center" width="10%">
<a href="#" title="2014-3-28 10:28:46">2014-3-28</a><br>
[<font color=deeppink><strong>全<strong></font>]
</td><td align="center" width="7%">
<a href="LicConDisp.aspx?CID=dfec1630-e1d4-478c-9feb-415eedbd6184" target="_blank">查看</a>
</td><td align="center" width="10%">
<a title="9900014479" href="ConStatusDisp.aspx?CID=dfec1630-e1d4-478c-9feb-415eedbd6184">已结束[<font color=red>HOPETE140317-4 </font>][<font color=deeppink><strong>全<strong></font>]</a>
</td><td align="center" width="5%">
1.0
</td>
</tr><tr>
<td width="21%">HOPE-HT-YX-S-140228-102</td>
<td width="14%">
G公司
</td>
<td width="14%">F公司</td>
<td width="5%">david</td><td align="center" width="10%">
未通过
</td><td align="center" width="10%">
未发货
</td><td align="center" width="7%">
<a href="LicConDisp.aspx?CID=9e19e1c9-7644-4392-9bdd-89e2bac346cd" target="_blank">查看</a>
</td><td align="center" width="10%">
<a title="" href="ConStatusDisp.aspx?CID=9e19e1c9-7644-4392-9bdd-89e2bac346cd">已作废</a>
</td><td align="center" width="5%">
1.0
</td>
</tr><tr>
<td width="21%">HOPE-HT-YX-S-140228-005</td>
<td width="14%">
T公司
</td>
<td width="14%">J公司 </td>
<td width="5%">jack</td><td align="center" width="10%">
<a href="#" title="2014-2-28 14:54:26">2014-2-28</a>
</td><td align="center" width="10%">
<a href="#" title="2014-3-28 10:55:11">2014-3-28</a><br>
[<font color=deeppink><strong>全<strong></font>]
</td><td align="center" width="7%">
<a href="LicConDisp.aspx?CID=45039bfb-ccb8-49f4-b8fe-27bc8cf59803" target="_blank">查看</a>
</td><td align="center" width="10%">
<a title="9900014480" href="ConStatusDisp.aspx?CID=45039bfb-ccb8-49f4-b8fe-27bc8cf59803">已结束[<font color=red>HOPECE140228-10</font>][<font color=deeppink><strong>全<strong></font>]</a>
</td><td align="center" width="5%">
1.0
</td>
</tr><tr>
<td width="21%">HOPE-HT-YX-S-140228-002</td>
<td width="14%">
S公司
</td>
<td width="14%">V公司</td>
<td width="5%">张军</td><td align="center" width="10%">
<a href="#" title="2014-2-28 14:13:23">2014-2-28</a>
</td><td align="center" width="10%">
<a href="#" title="2014-3-28 10:28:54">2014-3-28</a><br>
[<font color=deeppink><strong>全<strong></font>]
</td><td align="center" width="7%">
<a href="LicConDisp.aspx?CID=02a8d406-a826-4a5a-b466-f4bca2640307" target="_blank">查看</a>
</td><td align="center" width="10%">
<a title="9900014479" href="ConStatusDisp.aspx?CID=02a8d406-a826-4a5a-b466-f4bca2640307">已结束[<font color=red>HOPETE140307-4 </font>][<font color=deeppink><strong>全<strong></font>]</a>
</td><td align="center" width="5%">
1.0
</td>
</tr><tr>
<td width="21%"> </td>
<td width="14%">
A公司
</td>
<td width="14%">W公司</td><td width="5%">jack</td><td align="center" width="10%">
<a href="#" title="2014-2-28 14:13:38">2014-2-28</a>
</td><td align="center" width="10%">
<a href="#" title="2014-3-28 10:29:03">2014-3-28</a><br>
[<font color=deeppink><strong>全<strong></font>]
</td><td align="center" width="7%">
<a href="LicConDisp.aspx?CID=2684c70a-baea-4da4-911b-19cdbe627fd9" target="_blank">查看</a>
</td><td align="center" width="10%">
<a title="9900014479" href="ConStatusDisp.aspx?CID=2684c70a-baea-4da4-911b-19cdbe627fd9">已结束[<font color=red>HOPETE140307-3 </font>][<font color=deeppink><strong>全<strong></font>]</a>
</td><td align="center" width="5%">
1.0
</td>
</tr><tr>
<td width="21%">HOPE-HT-YX-S-140228-013</td>
<td width="14%">
B公司
</td><td width="14%">V公司</td>
<td width="5%">rose</td><td align="center" width="10%">
<a href="#" title="2014-2-28 14:19:28">2014-2-28</a>
</td><td align="center" width="10%">
<a href="#" title="2014-3-28 10:28:23">2014-3-28</a><br>
[<font color=deeppink><strong>全<strong></font>]
</td><td align="center" width="7%">
<a href="LicConDisp.aspx?CID=1204ad5e-4552-43af-a650-19b93f9d2514" target="_blank">查看</a>
</td><td align="center" width="10%">
<a title="9900014479" href="ConStatusDisp.aspx?CID=1204ad5e-4552-43af-a650-19b93f9d2514">已结束[<font color=red>HOPETE140307-2 </font>][<font color=deeppink><strong>全<strong></font>]</a>
</td><td align="center" width="5%">
1.0
</td>
</tr><tr>
<td width="21%">HOPE-HT-YX-S-140226-018</td>
<td width="14%">
C公司
</td><td width="14%">A中心</td>
<td width="5%">david</td><td align="center" width="10%">
<a href="#" title="2014-2-26 14:56:14">2014-2-26</a>
</td><td align="center" width="10%">
<a href="#" title="2014-3-14 15:27:44">2014-3-14</a><br>
[<font color=deeppink><strong>全<strong></font>]
</td><td align="center" width="7%">
<a href="LicConDisp.aspx?CID=a04cdcd2-5b5c-4182-a22d-a29399ab6991" target="_blank">查看</a>
</td><td align="center" width="10%">
<a title="9900014388" href="ConStatusDisp.aspx?CID=a04cdcd2-5b5c-4182-a22d-a29399ab6991">已结束[<font color=red>HOPECE140228-1 </font>][<font color=deeppink><strong>全<strong></font>]</a>
</td><td align="center" width="5%">
1.0
</td>
</tr><tr align="right">
<td colspan="9"><span>1</span> <a href="javascript:__doPostBack('dgContract$_ctl14$_ctl1','')">2</a> <a href="javascript:__doPostBack('dgContract$_ctl14$_ctl2','')">3</a></td>
</tr>