通过BeautifulSoup,Python从网址中提取纯文本,但仍然不干净

时间:2016-03-21 10:31:05

标签: python beautifulsoup

我正在尝试提取给定网址的纯文本。 根据我的搜索,最相关的工具似乎是BeautifulSoup,所以我写了一个简单的程序来测试。 但是,我发现它仍然不能满足我的要求。结果包含许多非纯文本。

您可以运行以下python代码来查看结果。

import urllib
url = "http://www.amfastech.com/2015/07/lenovo-k3-note-brutally-honest-review-specifications-pros-cons.html"
html = urllib.urlopen(url).read().decode('utf8')

from bs4 import BeautifulSoup
raw = BeautifulSoup(html).get_text()

当您看到raw时,结果包含如下代码:

 (function() { (function(){function
 c(a){this.t={};this.tick=function(a,c,b){var d=void 0!=b?b:(new
 Date).getTime();this.t[a]=[d,c];if(void
 0==b)try{window.console.timeStamp("CSI/"+a)}catch(e){}};this.tick("start",null,a)}var
 a;window.performance&&(a=window.performance.timing);var h=a?new
 c(a.responseStart):new c;window.jstiming={Timer:c,load:h};if(a){var
 b=a.navigationStart,e=a.responseStart;0<b&&e>=b&&(window.jstiming.srt=e-b)}if(a){var
 d=window.jstiming.load;0<b&&e>=b&&(d.tick("_wtsrt",void
 0,b),d.tick("wtsrt_",
 "_wtsrt",e),d.tick("tbsd_","wtsrt_"))}try{a=null,window.chrome&&window.chrome.csi&&(a=Math.floor(window.chrome.csi().pageT),d&&0<b&&(d.tick("_tbnd",void
 0,window.chrome.csi().startE),d.tick("tbnd_","_tbnd",b))),null==a&&window.gtbExternal&&(a=window.gtbExternal.pageT()),null==a&&window.external&&(a=window.external.pageT,d&&0<b&&(d.tick("_tbnd",void
 0,window.external.startE),d.tick("tbnd_","_tbnd",b))),a&&(window.jstiming.pt=a)}catch(k){}})();window.tickAboveFold=function(c){var
 a=0;if(c.offsetParent){do
 a+=c.offsetTop;while(c=c.offsetParent)}c=a;750>=c&&window.jstiming.load.tick("aft")};var
 f=!1;function
 g(){f||(f=!0,window.jstiming.load.tick("firstScrollTime"))}window.addEventListener?window.addEventListener("scroll",g,!1):window.attachEvent("onscroll",g);
 })();

所以我的问题是,如何才能真正从Python获取html中的干净纯文本。我看到许多Web工具支持所谓的书籍查看模式,在大多数情况下你只能看到主要文章,所以我认为提取干净的纯文本不应该是一个问题。谢谢!

2 个答案:

答案 0 :(得分:1)

嗯,你使用BeautifulSoup是错误的,提取你的文字,你不应该得到原始文本...... BS不是一个神奇的魔杖,猜测你需要的页面,它需要被告知该怎么做。所以你应该寻找你想要提取的对象的类和id:

>>> bs.find_all('h1')[0].getText()
u'\nLenovo K3 Note Brutally Honest Review: Specifications, Pros and Cons\n'
>>> bs.find_all(attrs={'class': 'post-body', 'class': 'entry-content'})[0].getText()
u'\n\n\n\n\n\n(adsbygoogle = window.adsbygoogle || []).push({});\n\n\nIt seems like Lenovo has finally caught the pulse of smartphone market in countries like India. After the successful launch of A6000, 6000+ and A7000, the company has come up with something big, both psychically and performance wise, with a name k3 note.The term \u2018Note\u2019 itself reminds us of the large phones which was actually been started mentioning by Samsung for its phablets. Like all other smartphone manufacturer companies, Lenovo also took up the term for its new boy.In this review, I\u2019ll be discussing the specifications of the K3 Note phablet in the price point of view and will be discussing the pros and cons of this device honestly brutally honestly.Let\u2019s begin! In the boxAlong with the handset, you will get a screen guard (non-tamper proof), 2-pin wall mounted charger, USB cable and removable battery in the box. K3 Note will not be accompanied by the headset in the box. That\u2019s somewhat upsetting to see A7000 coming with one and K3 Note with none. DesignNo actual changes were made to the physical design of Lenovo K3 Note compared to its predecessor, A7000. In fact, you will not see the difference between the two devices physically when kept side-by-side. \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 The screen size, body, camera, flash and speaker, buttons and slots are in the same position as A7000. K3 Note\u2019s physical design looks as good as A7000 but not build that tough. The body has low build quality and it can easily be broken under the appliance of little \u2018more\u2019 pressure. DisplayLenovo K3 Note comes with 5.5 inch Full HD IPS display that can render 401 pixels per inch (PPI) on 1080P resolution display.The screen contributes 72% to the body ratio thus making it a large screen-less body device. The best viewing angles of the screen has specified to be 178 degrees and it has 5-point touch sensor that can recognize 5-touch points simultaneously. Processor & RAMLenovo K3 Note comes with 1.7 GHz MediaTek Cortex A53 64-bit processor which is 0.2GHz faster than Lenovo A7000. The 2 GB RAM supports the processor at its best in multi-tasking.The combo is supported with ARM Mali-T760 MP2 GPU which is not so different to A7000\u2019s. You can experience good 3D gaming with this GPU configuration in parallel with the processor and RAM. MemoryK3 Note comes with 16 GB built-in ROM and allows users to expand the memory up to 32 GB through microSD card. This is an upgraded feature when compared to Lenovo A7000\u2019s 8 GB ROM.  Operating SystemK3 Note runs on Android Lollipop v5.0 which is not even 5.0.2. It is sad to see Lenovo\u2019s next product, after A7000 coming with v5.0. It is expected to get Android Lollipop v5.1 in future. CameraLenovo has upgraded the rear camera for K3 Note from 8MP to 13MP. The dual tone LED flash helps to take best shots in both lighting conditions. The camera is added with some new shooting modes compared to A7000. It can record full HD\xa01080P resolution videos with 30 frames per second rate.The front camera can take 5MP sharp photos and it is good enough to take best selfies.K3 Note\u2019s camera specifications are satisfying for its price range. ConnectivityIt supports 4G LTE networks in both the slots and have the same Wi-Fi, Bluetooth and OTG support specifications that A7000 came up with. BatteryLenovo K3 Note has got 2900mAh powered battery which can hold the charging on moderate usage for 24 hours at most. The 1080P screen absorbs the juice quickly and so it cannot last as long as A7000. Pros  A bit more fast processor  Upgraded camera  More internal memory  Full HD screen  Full HD recording  Removable battery Cons  Low built quality body  Same design as A7000  No Lollipop v5.0.2 at least  No Gorilla Glass 3 protection  High SAR values 1.590W/KG for head and 0.688W/KG for body Update: Unboxing photos (shared by a fan exclusively for Amfas Tech) \xa0  For more photos: Check out Lenovo K3 Note album on our Facebook page. \xa0 Final VerdictLenovo K3 Note has got some improvements like 16 GB internal storage, 1080P screen and video recording, little faster processor. The rest of the phone is a quite replica of Lenovo A7000. It could have been named as \u2018Lenovo A7000 Plus\u2019 instead of \u2018K3 Note\u2019.After looking at the specifications and advancements, Lenovo K3 Note for such a low price of 9,999 INR is a great deal. If you are planning to buy A7000, dare 1,000 bucks more for K3 Note and you will get a damn good phone for that price (statement made keeping price in mind).Note: If you talk more on phone, think a while choosing this phone as its SAR values are very highly specified.\n\n\n\n\n\n(adsbygoogle = window.adsbygoogle || []).push({});\n\n\n\n\n\n(adsbygoogle = window.adsbygoogle || []).push({});\n\nPlease share this article if you like it! Bless me or curse me in comments! Thank you for reading anyway!\n\n\n\n\n'

还有一些清洁要做(主要是因为文本中的广告JS),但它主要是在那里。您需要查看要保留在正文中的标记/类/ ID。

  

所以我的问题是,我怎样才能真正从Python获取html中的干净纯文本。我看到许多网络工具支持所谓的书籍查看模式,在大多数情况下你只能看到主要文章,所以我认为提取干净的纯文本不应该是一个问题

它没有关系,“原始”文本只是一种不同的CSS样式,只显示文本。但它并没有使页面的来源更简单。

答案 1 :(得分:1)

您需要提取stylescript标记,并使用.decompose方法销毁内容。从那里只需使用get_text来获取文本。

from urllib.request import urlopen # import urllib in Python 2.x
from bs4 import BeautifulSoup


url = "http://www.amfastech.com/2015/07/lenovo-k3-note-brutally-honest-review-specifications-pros-cons.html"
html = urlopen(url).read()  
soup = BeautifulSoup(html, 'lxml') 
for tag in soup.find_all(['script', 'style']):
    tag.decompose()   
soup.get_text(strip=True)

哪个收益率:

  

“联想K3注意残酷诚实评论:规格,优点和缺点H关于我们博客索引服务新闻访客联系我们您现在的位置:首页»智能手机评论»联想K3 Note诚实评论:规格,优点和缺点Sasidhar Kareti10:40:00 AMLenovo K3 Note残酷诚实的评论:规格,优点和缺点似乎联想终于在印度这样的国家抓住了智能手机市场的脉搏。在成功推出A6000,6000 +和A7000之后,该公司已经提出了一些大的,无论是在精神上还是在性能方面都是如此,名称为k3 note。术语“注释”本身就是.........