将BeautifulSoup输出保存到mongodb并再次加载

时间:2014-03-22 08:54:44

标签: python mongodb beautifulsoup

我有一个抓取工具,可以为我的应用获取某些网页。 我想分开关注点,爬虫应该是“哑巴”,只需获取页面,获取BeautifulSoup JSON,然后将其保存到MongoDB中。

其他工作人员应该阅读MongoDB文档并解压缩 相关信息到关系模型中。

问题是如何安全地将BeautifulSoup对象转换为JSON(MongoDB文档)并自行,安全且无错误地返回给它。

编辑: 的插图

 import urllib2
 import json
 from bs4 import BeautifulSoup
 req = urllib2.Request('http://www.google.com')
 res = urllib2.urlopen(req)
 soup = BeautifulSoup(res.read())
 content = soup.findAll(text=True)
 soup_json = json.dumps(content)
 soup_json

输出:

'["doctype html", "Google", "(function(){\\nwindow.google={kEI:\\"LGktU9bfHqHk4wT1poGoAg\\",getEI:function(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute(\\"eid\\")));)a=a.parentNode;return b||google.kEI},https:function(){return\\"https:\\"==window.location.protocol},kEXPI:\\"4006,17259,4000116,4007661,4007830,4008067,4008133,4008142,4009033,4009352,4009565,4009641,4010297,4010806,4010858,4010899,4011228,4011258,4011679,4011959,4012373,4012504,4012507,4013338,4013374,4013414,4013416,4013591,4013723,4013747,4013787,4013823,4013967,4013979,4014016,4014431,4014515,4014636,4014649,4014671,4014792,4014804,4014813,4014991,4015119,4015155,4015195,4015234,4015260,4015320,4015444,4015497,4015514,4015582,4015589,4015637,4015638,4015640,4015690,4015772,4015853,4015904,4015991,4015995,4016007,4016047,4016062,4016139,4016167,4016193,4016304,4016311,4016407,8300007,8300015,8300018,8500149,8500157,10200002,10200012,10200029,10200030,10200040,10200045,10200048,10200053,10200055,10200066,10200083,10200103,10200120,10200134,10200157\\",kCSI:{e:\\"4006,17259,4000116,4007661,4007830,4008067,4008133,4008142,4009033,4009352,4009565,4009641,4010297,4010806,4010858,4010899,4011228,4011258,4011679,4011959,4012373,4012504,4012507,4013338,4013374,4013414,4013416,4013591,4013723,4013747,4013787,4013823,4013967,4013979,4014016,4014431,4014515,4014636,4014649,4014671,4014792,4014804,4014813,4014991,4015119,4015155,4015195,4015234,4015260,4015320,4015444,4015497,4015514,4015582,4015589,4015637,4015638,4015640,4015690,4015772,4015853,4015904,4015991,4015995,4016007,4016047,4016062,4016139,4016167,4016193,4016304,4016311,4016407,8300007,8300015,8300018,8500149,8500157,10200002,10200012,10200029,10200030,10200040,10200045,10200048,10200053,10200055,10200066,10200083,10200103,10200120,10200134,10200157\\",ei:\\"LGktU9bfHqHk4wT1poGoAg\\"},authuser:0,ml:function(){},kHL:\\"iw\\",time:function(){return(new Date).getTime()},log:function(a,b,c,h,k){var d=\\nnew Image,f=google.lc,e=google.li,g=\\"\\";d.onerror=d.onload=d.onabort=function(){delete f[e]};f[e]=d;c||-1!=b.search(\\"&ei=\\")||(g=\\"&ei=\\"+google.getEI(h));c=c||\\"/\\"+(k||\\"gen_204\\")+\\"?atyp=i&ct=\\"+a+\\"&cad=\\"+b+g+\\"&zx=\\"+google.time();a=/^http:/i;a.test(c)&&google.https()?(google.ml(Error(\\"GLMM\\"),!1,{src:c}),delete f[e]):(d.src=c,google.li=e+1)},lc:[],li:0,y:{},x:function(a,b){google.y[a.id]=[a,b];return!1},load:function(a,b,c){google.x({id:a+l++},function(){google.load(a,b,c)})}};var l=0;})();\\n(function(){google.sn=\\"webhp\\";google.timers={};google.startTick=function(a,b){google.timers[a]={t:{start:google.time()},bfr:!!b}};google.tick=function(a,b,g){google.timers[a]||google.startTick(a);google.timers[a].t[b]=g||google.time()};google.startTick(\\"load\\",!0);\\ntry{}catch(d){}})();\\nvar _gjwl=location;function _gjuc(){var a=_gjwl.href.indexOf(\\"#\\");if(0<=a&&(a=_gjwl.href.substring(a),0<a.indexOf(\\"&q=\\")||0<=a.indexOf(\\"#q=\\"))&&(a=a.substring(1),-1==a.indexOf(\\"#\\"))){for(var d=0;d<a.length;){var b=d;\\"&\\"==a.charAt(b)&&++b;var c=a.indexOf(\\"&\\",b);-1==c&&(c=a.length);b=a.substring(b,c);if(0==b.indexOf(\\"fp=\\"))a=a.substring(0,d)+a.substring(c,a.length),c=d;else if(\\"cad=h\\"==b)return 0;d=c}_gjwl.href=\\"/search?\\"+a+\\"&cad=h\\";return 1}return 0}\\nfunction _gjh(){!_gjuc()&&window.google&&google.x&&google.x({id:\\"GJH\\"},function(){google.nav&&google.nav.gjh&&google.nav.gjh()})};\\nwindow._gjh&&_gjh();", "#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:left}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-left:.5em;vertical-align:top}#gbar{float:right}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}", "body,td,a,p,.h{font-family:arial,sans-serif}body{margin:0;overflow-y:scroll}#gog{padding:3px 8px 0}td{line-height:.8em}.gac_m td{line-height:17px}form{margin-bottom:20px}.h{color:#36c}.q{color:#00c}.ts td{padding:0}.ts{border-collapse:collapse}em{font-weight:bold;font-style:normal}.lst{height:25px;width:496px}.gsfi,.lst{font:18px arial,sans-serif}.gsfs{font:17px arial,sans-serif}.ds{display:inline-box;display:inline-block;margin:3px 0 4px;margin-right:4px}input{font-family:inherit}a.gb1,a.gb2,a.gb3,a.gb4{color:#11c !important}body{background:#fff;color:black}a{color:#11c;text-decoration:none}a:hover,a:active{text-decoration:underline}.fl a{color:#36c}a:visited{color:#551a8b}a.gb1,a.gb4{text-decoration:underline}a.gb3:hover{text-decoration:none}#ghead a.gb2:hover{color:#fff !important}.sblc{padding-top:5px}.sblc a{display:block;margin:2px 0;margin-right:13px;font-size:11px}.lsbb{background:#eee;border:solid 1px;border-color:#ccc #ccc #999 #999;height:30px}.lsbb{display:block}.ftl,#fll a{display:inline-block;margin:0 12px}.lsb{background:url(/images/srpr/nav_logo80.png) 0 -258px repeat-x;border:none;color:#000;cursor:pointer;height:30px;margin:0;outline:0;font:15px arial,sans-serif;vertical-align:top}.lsb:active{background:#ccc}.lst:focus{outline:none}#addlang a{padding:0 3px}.tiah{width:458px}", "(function(){var src=\'/images/nav_logo176.png\';var iesg=false;document.body.onload = function(){window.n && window.n();if (document.images){new Image().src=src;}\\nif (!iesg){document.f&&document.f.q.focus();document.gbqf&&document.gbqf.q.focus();}\\n}\\n})();", " ", "\\u00e7\\u00e9\\u00f4\\u00e5\\u00f9", " ", "\\u00fa\\u00ee\\u00e5\\u00f0\\u00e5\\u00fa", " ", "\\u00ee\\u00f4\\u00e5\\u00fa", " ", "YouTube", " ", "\\u00e7\\u00e3\\u00f9\\u00e5\\u00fa", " ", "Gmail", " ", "Drive", " ", "\\u00e9\\u00e5\\u00ee\\u00ef", " ", "\\u00f2\\u00e5\\u00e3", " \\u00bb", "\\u00e4\\u00e9\\u00f1\\u00e8\\u00e5\\u00f8\\u00e9\\u00e9\\u00fa \\u00e0\\u00fa\\u00f8\\u00e9\\u00ed", " | ", "\\u00e4\\u00e2\\u00e3\\u00f8\\u00e5\\u00fa", " | ", "\\u00e4\\u00e9\\u00eb\\u00f0\\u00f1", " ", "\\u00e9\\u00f9\\u00f8\\u00e0\\u00ec", "\\u00a0", "\\u00e7\\u00e9\\u00f4\\u00e5\\u00f9 \\u00ee\\u00fa\\u00f7\\u00e3\\u00ed", "\\u00eb\\u00ec\\u00e9 \\u00f9\\u00f4\\u00e4", "Google.co.il \\u00e2\\u00ed \\u00e1: ", "\\u0627\\u0644\\u0639\\u0631\\u0628\\u064a\\u0629", " ", "English", " \\u00f4\\u00f8\\u00f1\\u00e5\\u00ed \\u00e1-Google", "\\u00f4\\u00fa\\u00f8\\u00e5\\u00f0\\u00e5\\u00fa \\u00f2\\u00f1\\u00f7\\u00e9\\u00e9\\u00ed", "\\u00e4\\u00eb\\u00ec \\u00e0\\u00e5\\u00e3\\u00e5\\u00fa Google", "Google.com", "\\u00a9 2013 - ", "\\u00f4\\u00f8\\u00e8\\u00e9\\u00e5\\u00fa \\u00e5\\u00fa\\u00f0\\u00e0\\u00e9\\u00ed", "if(google.y)google.y.first=[];(function(){function b(a){window.setTimeout(function(){var c=document.createElement(\\"script\\");c.src=a;document.getElementById(\\"xjsd\\").appendChild(c)},0)}google.dljp=function(a){google.xjsu=a;b(a)};google.dlj=b;})();\\nif(!google.xjs){window._=window._||{};window._._DumpException=function(e){throw e};if(google.timers&&google.timers.load.t){google.timers.load.t.xjsls=new Date().getTime();}google.dljp(\'/xjs/_/js/k\\\\x3dxjs.hp.en_US.X67G-1Nbjpc.O/m\\\\x3dsb_he,pcc/rt\\\\x3dj/d\\\\x3d1/sv\\\\x3d1/rs\\\\x3dAItRSTO_vkVhEK6twEUdYclvmSrFcRL-Zw\');google.xjs=1;}google.pmc={\\"sb_he\\":{\\"agen\\":true,\\"cgen\\":true,\\"client\\":\\"heirloom-hp\\",\\"dh\\":true,\\"ds\\":\\"\\",\\"eqch\\":true,\\"fl\\":true,\\"host\\":\\"google.co.il\\",\\"jsonp\\":true,\\"msgs\\":{\\"dym\\":\\"\\u00e4\\u00e0\\u00ed \\u00e4\\u00fa\\u00eb\\u00e5\\u00e5\\u00f0\\u00fa \\u00ec:\\",\\"lcky\\":\\"\\u00e9\\u00e5\\u00fa\\u00f8 \\u00ee\\u00e6\\u00ec \\u00ee\\u00f9\\u00eb\\u00ec\\",\\"lml\\":\\"\\u00ec\\u00ee\\u00e9\\u00e3\\u00f2 \\u00f0\\u00e5\\u00f1\\u00f3\\",\\"oskt\\":\\"\\u00eb\\u00ec\\u00e9 \\u00e4\\u00e6\\u00f0\\u00e4\\",\\"psrc\\":\\"\\u00e7\\u00e9\\u00f4\\u00e5\\u00f9 \\u00e6\\u00e4 \\u00e4\\u00e5\\u00f1\\u00f8 \\u00ee\\\\u003Ca href=\\\\\\"/history\\\\\\"\\\\u003E\\u00e4\\u00e9\\u00f1\\u00e8\\u00e5\\u00f8\\u00e9\\u00e9\\u00fa \\u00e4\\u00e0\\u00e9\\u00f0\\u00e8\\u00f8\\u00f0\\u00e8\\\\u003C/a\\\\u003E \\u00f9\\u00ec\\u00ea\\",\\"psrl\\":\\"\\u00e4\\u00f1\\u00f8\\",\\"sbit\\":\\"\\u00e7\\u00f4\\u00f9 \\u00ec\\u00f4\\u00e9 \\u00fa\\u00ee\\u00e5\\u00f0\\u00e4\\",\\"srch\\":\\"\\u00e7\\u00e9\\u00f4\\u00e5\\u00f9 \\u00e1-Google\\"},\\"ovr\\":{},\\"pq\\":\\"\\",\\"qcpw\\":false,\\"scd\\":10,\\"sce\\":5,\\"stok\\":\\"AVgtYJUWkObPx6V5QqvD7hitdNE\\"},\\"pcc\\":{}};google.y.first.push(function(){if(google.med){google.med(\'init\');google.initHistory();google.med(\'history\');}});if(google.j&&google.j.en&&google.j.xi){window.setTimeout(google.j.xi,0);}", "(function(){if(google.timers&&google.timers.load.t){var b,c,d,e,g=function(a,f){a.removeEventListener?(a.removeEventListener(\\"load\\",f,!1),a.removeEventListener(\\"error\\",f,!1)):(a.detachEvent(\\"onload\\",f),a.detachEvent(\\"onerror\\",f))},h=function(a){e=(new Date).getTime();++c;a=a||window.event;a=a.target||a.srcElement;g(a,h)},k=document.getElementsByTagName(\\"img\\");b=k.length;for(var l=c=0,m;l<b;++l)m=k[l],m.complete||\\"string\\"!=typeof m.src||!m.src?++c:m.addEventListener?(m.addEventListener(\\"load\\",h,!1),m.addEventListener(\\"error\\",\\nh,!1)):(m.attachEvent(\\"onload\\",h),m.attachEvent(\\"onerror\\",h));d=b-c;var n=function(){if(google.timers.load.t){google.timers.load.t.ol=(new Date).getTime();google.timers.load.t.iml=e;google.kCSI.imc=c;google.kCSI.imn=b;google.kCSI.imp=d;void 0!==google.stt&&(google.kCSI.stt=google.stt);google.csiReport&&google.csiReport()}};window.addEventListener?window.addEventListener(\\"load\\",n,!1):window.attachEvent&&\\nwindow.attachEvent(\\"onload\\",n);google.timers.load.t.prt=e=(new Date).getTime()};})();\\n"]'

这个JSON应该以一种方式保存在MongoDB中,这样我以后就可以从中重新选择一个美丽的汤对象。

1 个答案:

答案 0 :(得分:3)

仅供参考,你真的不需要在将汤存入mongo(或任何数据库)之前构建汤。

以下是我的理由:

(1)当你把它变成汤',这是一个'bs4.BeautifulSoup'类,当你把它存储到mongo中时,它将采用txt格式,无论是json格式还是其他格式。下次当你尝试从数据库中取出时,你需要再次调用BeautifulSoup函数来重建string / json中的汤,这显然是BeautifulSoup-ed两次。

(2)Soup基本上是基于HTML页面构建的xml树。 BeautifulSoup将解析树,有时修复损坏/丢失的标签,做一些你可能不想要的“智能员工”,或稍微修改HTML页面。例如,你可以根据你使用的解析器获得不同类型的结果,“lxml”/“html5”......所以在存储数据之前使用beautifulsoup可能会搞砸你。

总之:我建议您存储原始HTML内容而不做任何工作。存储它们的最简单方法是按照以下格式构建文档:

{"url":"www.xxx.com/..", "html":"<DOCTYPE!>...."}

在这种情况下,您基本上将网站镜像/索引到本地计算机,不会遗漏任何信息。

以下是一些可以帮助您使用mongo存储/检索html的代码:

>>> from pymongo import MongoClient
>>> client = MongoClient('localhost', 27017)
>>> db = client.oleg
>>> 
>>> # get the raw html 
... url = "http://www.crummy.com/software/BeautifulSoup/bs4/doc/#"
>>> import urllib2
>>> html = urllib2.urlopen(url).read()
>>> html[:100]
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n  "http://www.w3.org/TR/xhtml1/DTD/xh'
>>> 
>>>
>>> # store the <key:value> -> <url:html> into mongo for later use
... db.tikhonov.insert({"url":url, "html":html})
ObjectId('532e6904866cd3431a90c618')
>>>
>>> # retrieve the stored html by search the url
... record = db.tikhonov.find_one({"url":url})
>>> record['url']
u'http://www.crummy.com/software/BeautifulSoup/bs4/doc/#'
>>> 
>>> # turn html txt into soup and start parsing
... from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(record['html'])
>>> soup.find("h1").text
u'Beautiful Soup Documentation\xb6' 

PS: 将“提取html”步骤与“解析”步骤分开是一个了不起的想法。您可以开始收集HTML页面而无需任何解析,因为HTTP请求总是花费最多的时间。你可以开始收集原始的html页面,同时编写和测试你的解析器。

在本地清理或存储知识产权之前,请务必仔细检查服务条款。