我是数据分析师,知道ML和DL,但我的网络抓不好。
我在抓数据。我想要做的是以下几点:
使用duckduckgo API,我提取查询的所有链接,查询就像,"什么是比特币"。
然后,在获得链接列表后,我想逐个抓取它们并分别存储文本,我可以用它来做我的NLP部分等。
但是,我的问题是。我无法从这些链接中获得最佳文本,有时,对于少数链接,我无法从中读取HTML,这会引发getaddress错误。
这是我的代码。
import re, urllib
import pandas as pd
from bs4 import BeautifulSoup
from urllib import urlopen
query = "litecoin"
site = urlopen("http://duckduckgo.com/html/?q="+query)
data = site.read()
soup = BeautifulSoup(data, "html5lib")
my_list = soup.find("div", {"id": "links"}).find_all("div", {'class': re.compile('.*web-result*.')})[0:50]
print len(my_list)
(result__snippet, result_url) = ([] for i in range(2))
for i in my_list:
try:
result__snippet.append(i.find("a", {"class": "result__snippet"}).get_text().strip("\n").strip())
except:
result__snippet.append(None)
try:
result_url.append(i.find("a", {"class": "result__url"}).get_text().strip("\n").strip())
except:
result_url.append(None)
print(result_url)
[u'litecoin.org',
u'litecoin.com',
u'en.wikipedia.org/wiki/Litecoin',
u'coinmarketcap.com/currencies/litecoin/',
u'profitconfidential.com/category/cryptocurrency/litecoin/',
u'fortune.com/2017/12/12/litecoin-bitcoin-price-2018/',
u'finance.yahoo.com/news/litecoin-everything-need-know-184858...',
u'cointelegraph.com/tags/litecoin',
u'worldcoinindex.com/coin/LiteCoin',
u'litecoin.com/services',
u'forbes.com/sites/madhvimavadiya/2017/12/12/what-is-l...',
u'thecollegeinvestor.com/19673/how-to-invest-in-litecoin/',
u'cnbc.com/2017/12/12/litecoin-price-hits-record-hig...',
u'markets.businessinsider.com/currencies/ltc-usd',
u'gdax.com/trade/LTC-USD',
u'forbes.com/sites/jessedamiani/2017/12/13/5-reasons-w...',
u'twitter.com/litecoin',
u'fortune.com/2018/02/14/litecoin-price-cryptocurrency/',
u'coindesk.com/information/comparing-litecoin-bitcoin/',
u'fool.com/investing/2017/12/24/5-reasons-litecoin-i...',
u'profitconfidential.com/cryptocurrency/litecoin/what-is-litecoin/',
u'litecoin.miningpoolhub.com',
u'kitco.com/litecoin-price-charts-usd/',
u'cryptocompare.com/coins/ltc/',
u'lifewire.com/what-is-litecoin-4151693',
u'ibtimes.com/litecoin-price-predictions-2018-experts-f...',
u'livebitcoinnews.com/news/litecoin-news/',
u'money.cnn.com/2017/12/12/investing/litecoin-price-coinb...',
u'live.blockcypher.com/ltc/',
u'reddit.com/r/litecoin/']
from bs4 import BeautifulSoup
from bs4.element import Comment
from urllib import urlopen
# Now, I start my scraping.
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html5lib')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
try:
try:
html = urlopen('www.' + result_url[5]).read()
print(text_from_html(html))
except:
html = urlopen(result_url[1]).read()
print(text_from_html(html))
except:
html = urlopen('https://www.' + result_url[5]).read()
print(text_from_html(html))
所以,有两个问题,对于某些链接,它会引发错误,而在某些链接有效的情况下,提取的文本意味着更少。
请帮忙! 请纠正我,如果我在解释的某个地方出错了,请先抓一下。
TIA
答案 0 :(得分:0)
将搜索查询和搜索结果标题的标题拆分为关键字。从标签搜索开始最大匹配。然后按层次结构进入body标签,直到parent ==中的匹配匹配child。然后接孩子。继续处理,直到父母的比赛>每个孩子都配对。然后提取该元素中的文本。尝试使用关键字排名并查看效果。 如果您对单一来源的信息感到满意,可以使用Wikipedia Api。祝你好运!