从多个链接中提取文本并独立存储

时间:2018-02-21 09:59:40

标签: python html python-2.7 web-scraping beautifulsoup

我是数据分析师,知道ML和DL,但我的网络抓不好。

我在抓数据。我想要做的是以下几点:

  1. 使用duckduckgo API,我提取查询的所有链接,查询就像,"什么是比特币"。

  2. 然后,在获得链接列表后,我想逐个抓取它们并分别存储文本,我可以用它来做我的NLP部分等。

  3. 但是,我的问题是。我无法从这些链接中获得最佳文本,有时,对于少数链接,我无法从中读取HTML,这会引发getaddress错误。

    这是我的代码。

    import re, urllib
    import pandas as pd
    from bs4 import BeautifulSoup
    from urllib import urlopen
    
    
    query = "litecoin"
    site = urlopen("http://duckduckgo.com/html/?q="+query)
    data = site.read()
    soup = BeautifulSoup(data, "html5lib")
    
    
    my_list = soup.find("div", {"id": "links"}).find_all("div", {'class': re.compile('.*web-result*.')})[0:50]
    
    print len(my_list)
    
    (result__snippet, result_url) = ([] for i in range(2))
    
    for i in my_list:         
        try:
            result__snippet.append(i.find("a", {"class": "result__snippet"}).get_text().strip("\n").strip())
        except:
            result__snippet.append(None)
        try:
            result_url.append(i.find("a", {"class": "result__url"}).get_text().strip("\n").strip())
        except:
            result_url.append(None)
    
    print(result_url)
    [u'litecoin.org',
     u'litecoin.com',
     u'en.wikipedia.org/wiki/Litecoin',
     u'coinmarketcap.com/currencies/litecoin/',
     u'profitconfidential.com/category/cryptocurrency/litecoin/',
     u'fortune.com/2017/12/12/litecoin-bitcoin-price-2018/',
     u'finance.yahoo.com/news/litecoin-everything-need-know-184858...',
     u'cointelegraph.com/tags/litecoin',
     u'worldcoinindex.com/coin/LiteCoin',
     u'litecoin.com/services',
     u'forbes.com/sites/madhvimavadiya/2017/12/12/what-is-l...',
     u'thecollegeinvestor.com/19673/how-to-invest-in-litecoin/',
     u'cnbc.com/2017/12/12/litecoin-price-hits-record-hig...',
     u'markets.businessinsider.com/currencies/ltc-usd',
     u'gdax.com/trade/LTC-USD',
     u'forbes.com/sites/jessedamiani/2017/12/13/5-reasons-w...',
     u'twitter.com/litecoin',
     u'fortune.com/2018/02/14/litecoin-price-cryptocurrency/',
     u'coindesk.com/information/comparing-litecoin-bitcoin/',
     u'fool.com/investing/2017/12/24/5-reasons-litecoin-i...',
     u'profitconfidential.com/cryptocurrency/litecoin/what-is-litecoin/',
     u'litecoin.miningpoolhub.com',
     u'kitco.com/litecoin-price-charts-usd/',
     u'cryptocompare.com/coins/ltc/',
     u'lifewire.com/what-is-litecoin-4151693',
     u'ibtimes.com/litecoin-price-predictions-2018-experts-f...',
     u'livebitcoinnews.com/news/litecoin-news/',
     u'money.cnn.com/2017/12/12/investing/litecoin-price-coinb...',
     u'live.blockcypher.com/ltc/',
     u'reddit.com/r/litecoin/']
    
    
    
    from bs4 import BeautifulSoup
    from bs4.element import Comment
    
    from urllib import urlopen
    
    # Now, I start my scraping.
    def tag_visible(element):
        if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
            return False
        if isinstance(element, Comment):
            return False
        return True
    
    
    def text_from_html(body):
        soup = BeautifulSoup(body, 'html5lib')
        texts = soup.findAll(text=True)
        visible_texts = filter(tag_visible, texts)  
        return u" ".join(t.strip() for t in visible_texts)
    
    try:
        try:
            html = urlopen('www.' + result_url[5]).read()
            print(text_from_html(html))
        except:
            html = urlopen(result_url[1]).read()
            print(text_from_html(html))
    except:
        html = urlopen('https://www.' + result_url[5]).read()
        print(text_from_html(html))
    

    所以,有两个问题,对于某些链接,它会引发错误,而在某些链接有效的情况下,提取的文本意味着更少。

    请帮忙! 请纠正我,如果我在解释的某个地方出错了,请先抓一下。

    TIA

1 个答案:

答案 0 :(得分:0)

将搜索查询和搜索结果标题的标题拆分为关键字。从标签搜索开始最大匹配。然后按层次结构进入body标签,直到parent ==中的匹配匹配child。然后接孩子。继续处理,直到父母的比赛>每个孩子都配对。然后提取该元素中的文本。尝试使用关键字排名并查看效果。 如果您对单一来源的信息感到满意,可以使用Wikipedia Api。祝你好运!