确定字符串的类型

时间:2013-03-10 04:06:05

标签: python beautifulsoup boilerpipe

我正在寻找某种方法来确定任何文章网站(例如this one)中字符串的类型。类型将是标题,作者,日期,文章本身。我使用BeautifulSoup和Boilerpipe来抓取相关内容:

from boilerpipe.extract import Extractor
from bs4 import BeautifulSoup
import urllib

urlvar = 'http://goarticles.com/article/Finding-Out-What-Makes-a-UK-VPN-Tick/7454423/'

result = urllib.urlopen(urlvar)
soup = BeautifulSoup(result.read())

extractor = Extractor(extractor='ArticleExtractor', html=soup.prettify())

extracted_html = extractor.getHTML()
print(extracted_html.encode("utf-8"))

现在我有一些看起来像这样的输出:

<HTML prefix="og: http://ogp.me/ns#">
<style type="text/css">
A:before { content:' '; } 
A:after { content:' '; } 
SPAN:before { content:' '; } 
SPAN:after { content:' '; } 
</style>
<BODY><DIV id="content"><DIV class="wrapper"><DIV id="article-content"><DIV id="main-col"><DIV align="left" class="article"><H1 class="art_head">
        Finding Out What Makes a UK VPN Tick
        <EM>
         by John H. Hickman
        </EM></H1><H2 class="art_cat"><B>
         in
         <A href="/category/technology/">
          Technology
         </A></B>
        (submitted 2013-03-09)
       </H2><P>
        How a UK VPN Works
       </P><P>
        A VPN creates a virtual tunnel through the Internet, through which encrypted data passes through. No matter how a user connects to the web, a VPN secures the connection with high grade encryption. This is important for public hotspots and other unsecure networks that put users at risk for digital eavesdropping. The benefits don't stop there; below is a description of the benefits of using a British VPN.
       </P><P>
        Can You Access UK Websites with a UK VPN?
       </P><P>
        Many websites are geo-restricted or blocked from users who don't have a UK IP address. A British VPN provider bypasses these restrictions so that users can visit any UK website.
       </P><P>
        Quality VPN providers will also offer other server connections that appear to be in several different countries, including the US, Europe and Asia. Each British VPN provider will offer a different number of servers in different locations. The more servers that the service provides, the more options users have.
       </P><P>
        Is Security That Important?
       </P><P>
        Many think that they have nothing to hide, but information of any kind is valuable to hackers and phishers on the Internet. Each time a person transfers data through the Internet, that information is open to interception. But a United Kingdom VPN provides an encrypted, secure connection so that data stays private. This is how a secure connection works:
       </P><P>
        ΓÇó The United Kingdom VPN provider provides an application that connects to the server. With a few clicks, the connection is configured and ready to go.

        ΓÇó The VPN replaces the actual IP address with one that isn't tied to a physical location.

        ΓÇó The United Kingdom VPN encrypts all information that the user sends and receives online.

        ΓÇó The VPN allows the user to bypass location based IP blocking.

        Many different British VPNs will feature 64bit, 128bit and 256bit encryption; the higher the bit, the better the encryption.

        Only the server and the client can decrypt the secure connection. If a person uses a Wi-Fi hotspot at work, school, the airport or in a restaurant, an unsecure connection can leave them vulnerable to phishing and packet sniffing. A UK VPN will keep this information safe and secure wherever a user accesses the Internet. It will keep passwords, the websites history and online activity secure. A good UK VPN provider will secure the user's Internet connection so that they can browse websites in the UK and other countries safe from hackers and enjoy the web without any worries.
       </P><H3 class="about_author">
        About the Author
       </H3><P>
        Thomas R. Hickman believes the internet should be free and open. The
        <A href="https://www.goldenfrog.com/vyprvpn/uk-vpn" target="_new">
         UK VPN Service
        </A>
        he mentions above is one tool you can use to increase your online freedom. Thomas strives to inform others about the tools available for avoiding restrictions on the web. The fastest UK VPN Provider can be found at
        <A href="https://www.goldenfrog.com/vyprvpn/uk-vpn" target="_new">
         https://www.goldenfrog.com/vyprvpn/uk-vpn
        </A></P><DIV id="a_disclaimer">
        Use and distribution of this article is subject to our
        <A href="/publisher.html" target="_blank">
         Publisher Guidelines
        </A>
        whereby the original author's information and copyright must be included.
       </DIV></DIV></DIV><DIV id="left-col"><P><B>
        John H. Hickman
       </B></P></DIV></DIV></DIV></DIV></BODY></HTML>

或者这个:

<HTML lang="en-US" xmlns:FB="http://www.facebook.com/2008/fbml" xmlns:OG="http://ogp.me/ns#">
<style type="text/css">
A:before { content:' '; } 
A:after { content:' '; } 
SPAN:before { content:' '; } 
SPAN:after { content:' '; } 
</style>
<BODY class="no-js yog-ltr yog-blog" dir="ltr"><DIV class="yog-bd" id="yog-bd" role="main"><DIV class="yog-wrap yog-grid yog-24u"><DIV class="yog-col yog-16u yom-primary"><DIV class="yog-wrap yom-art-bd"><DIV class="yog-col yog-11u"><DIV class="yom-mod yom-art-content "><DIV class="bd"><P class="first"><SPAN class="yom-figure yom-fig-right" style="width:630px;"><SPAN class="legend">
             A clock hangs in Plantation, Fla. (Joe Raedle/Getty Images)
            </SPAN></SPAN></P><P>
           Here's everything you wanted to know about the time change this weekend.
          </P><P><STRONG>
            When is it?
           </STRONG>
           The time change begins on Sunday, March 10, at 2 a.m., when clocks are moved forward by one hour.
          </P><P><STRONG>
            Why 2 a.m.?
           </STRONG>
           The time change is set for 2 a.m. because it was decided to be the
           <A href="http://www.lifeslittlemysteries.com/81-why-does-daylight-saving-time-begin-at-2-am.html">
            least disruptive
           </A>
           time of day. Moving time forward or back an hour at that time doesnΓÇÖt change the date, which avoids confusion, and most people are asleep, or if people do work on a Sunday, itΓÇÖs usually later than 2 a.m.
          </P><P><STRONG>
            Do all states observe daylight saving time?
           </STRONG>
           Hawaii and most of Arizona donΓÇÖt
           <A href="http://www.timeanddate.com/time/us/arizona-no-dst.html">
            observe the time change
           </A>
           . U.S. territories that donΓÇÖt go on daylight saving time include American Samoa, Guam, Puerto Rico and the Virgin Islands.
          </P><P><STRONG>
            Why do we have it?
           </STRONG>
           The idea is to save electricity because there are more hours of natural light.
           <A href="http://news.nationalgeographic.com/news/2008/03/080307-daylight-saving.html">
            Studies have shown
           </A>
           the savings to be fairly nominalΓÇöthe time change leading people to switch on the lights earlier in the morning instead or cranking up the air conditioning, for example.
          </P><P><STRONG>
            What is the history of daylight saving time?
           </STRONG>
           Fun fact: The idea was first floated in 1784 by one Benjamin Franklin. While minister of France, he
           <A href="http://www.livescience.com/11069-started-daylight-saving-time.html">
            wrote the essay
           </A>
           &quot;An Economical Project for Diminishing the Cost of Light.&quot;
          </P><P>
           The idea failed to see the light of day until 1883, when the U.S. railroads instituted a standardized time for their train schedules. That time change was imposed nationally during the First World War to conserve energy, but it was repealed after the war. It became the national time again during World War II.
          </P><P>
           After that, it was up to the states to decide if they wanted it, and when it would start and end. Congress finally enacted the Uniform Time Act in 1966, which standardized the beginning and end of daylight time for the states that observed it. In 1974 and 1975, the energy crisis moved Congress to enact earlier daylight start times, which were reversed when the crisis was over.
          </P><P><A href="http://aa.usno.navy.mil/faq/docs/daylight_time.php">
            Daylight saving time
           </A>
           since then had always been in AprilΓÇöuntil the Energy Policy Act of 2005 ordered the earlier start time to begin in March 2007.
          </P></DIV></DIV></DIV></DIV><DIV class="yom-mod yom-bcarousel ymg-carousel" id="MediaInfiniteBrowse"><DIV class="hd"><DIV class="yui-carousel-nav"><DIV class="heading"><H3>
           Explore Related Content
          </H3></DIV></DIV></DIV></DIV></DIV></DIV></DIV></BODY></HTML>

我正在寻找任何方法来确定哪个标签包含标题字符串,作者字符串,发布日期和文章字符串(如果有)。
我在使用scrapy时无所事事,但它不包含从不同站点获取此信息的任何通用算法。
我开始怀疑这样的事情甚至是可能的,除了一些疯狂的评估字符串的长度,并希望标题标签总是比作者标签有更多的字符,但比文章标签少。但这对于思考和输出可能非常不准确是非常幼稚的 有关如何做的任何指示?

0 个答案:

没有答案