刮掉10,000个网站的元数据太慢了(Python)

时间:2017-12-09 18:13:51

标签: python web-scraping python-requests metadata

大家好,

我正在尝试将10,000个网站的元数据解析为用于SEO /分析应用程序的Pandas数据框,但代码需要很长时间。我一直在尝试在1,000个网站上进行此操作,并且代码已经运行了3个小时(在10-50个网站上没有问题)。

以下是样本数据:

index   site    
0       http://www.google.com
1       http://www.youtube.com
2       http://www.facebook.com
3       http://www.cnn.com
...     ...
10000   http://www.sony.com

这是我的Python(2.7)代码:

# Importing dependencies
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import metadata_parser

# Loading the Pandas dataframe
df = pd.read_csv('final_urls')

# Utility functions
def meta(website, metadata):
    full_url = website
    parser = metadata_parser.MetadataParser(url=full_url)
    if metadata == 'all':
        return parser.metadata
    else:
        return parser.metadata[metadata]

def meta_all(website):
    try:
        result = meta(website, 'all')
    except BaseException:
        result = 'Exception'
    return result

# Main
df['site'].apply(meta_all)

我希望代码更快。我一直在使用metadata_parser库(https://github.com/jvanasco/metadata_parser),该库严重依赖于requestsBeautifulSoup

  • 我知道我可以将解析器更改为lxml,以使代码更快。它已经安装在我的机器上,因此BeautifulSoup应该将它作为首选。
  • 您是否有任何建议让这段代码更快地运行?

谢谢!

1 个答案:

答案 0 :(得分:1)

您可以使用Python Twisted(Twisted是一个用Python编写的事件驱动的网络引擎)。您需要安装一些带有pip的软件包,可能是twisted,pyopenssl和service_identity。此代码适用于Python 2.7,您说您正在使用它。

from twisted.internet import defer, reactor
from twisted.web.client import getPage
import metadata_parser
import pandas as pd
import numpy as np
from multiprocessing import Process

def pageCallback(result, url):
    data = {
        'content': result,
        'url': url,
    }
    return data

def getPageData(url):
    d = getPage(url)
    d.addCallback(pageCallback, url)
    return d

def listCallback(result):
    for isSuccess, data in result:
        if isSuccess:
            print("Call to %s succeeded " % (data['url']))
            parser = metadata_parser.MetadataParser(html=data['content'], search_head_only=False)
            print(parser.metadata)  # do something with it here

def finish(ign):
    reactor.stop()

def start(urls):
    data = []
    for url in urls:
        data.append(getPageData(url))
    dl = defer.DeferredList(data)
    dl.addCallback(listCallback)
    dl.addCallback(finish)

def processStart(chunk):
    start(chunk)
    reactor.run()

df = pd.read_csv('final_urls')
urls = df['site'].values.tolist()
chunkCounter = 0
chunkLength = 1000
for chunk in np.array_split(urls,len(urls)/chunkLength):
    p = Process(target=processStart, args=(chunk,))
    p.start()
    p.join()
    chunkCounter += 1
    print("Finished chunk %s of %s URLs" % (str(chunkCounter), str(chunkLength)))

我已经在10,000个网址上运行了它,花了不到16分钟。

<强>更新 通常你会处理你在我添加评论“#在这里做点什么”时生成的数据。如果您希望返回生成的数据进行处理,您可以执行以下操作(我已更新为使用treq。):

from twisted.internet import defer, reactor
import treq
import metadata_parser
import pandas as pd
import numpy as np
import multiprocessing
from twisted.python import log
import sys

# log.startLogging(sys.stdout)

results = []

def pageCallback(result, url):
    content = result.content()
    data = {
    'content': content,
    'url': url,
    }
    return data

def getPageData(url):
    d = treq.get(url, timeout=60, headers={'User-Agent': ["Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv'\:'57.0) Gecko/20100101 Firefox/57.0"]})
    d.addCallback(pageCallback, url)
    return d

def listCallback(result):
    global results
    for isSuccess, data in result:
        if isSuccess:
            print("Call to %s succeeded " % (data['url']))
            parser = metadata_parser.MetadataParser(html=str(data['content']), search_head_only=False)
            # print(parser.metadata)  # do something with it here
            results.append((data['url'], parser.metadata))

def finish(ign):
    reactor.stop()

def start(urls):
    data = []
    for url in urls:
        data.append(getPageData(url))
    dl = defer.DeferredList(data)
    dl.addCallback(listCallback)
    dl.addCallback(finish)

def processStart(chunk, returnList):
    start(chunk)
    reactor.run()
    returnList.extend(results)

df = pd.read_csv('final_urls')
urls = df['site'].values.tolist()
chunkCounter = 0
chunkLength = 1000

manager = multiprocessing.Manager()
returnList = manager.list()
for chunk in np.array_split(urls,len(urls)/chunkLength):
    p = multiprocessing.Process(target=processStart, args=(chunk,returnList))
    p.start()
    p.join()
    chunkCounter += 1
    print("Finished chunk %s of %s URLs" % (str(chunkCounter), str(chunkLength)))

for res in returnList:
    print (res)

print (len(returnList))

您可能还想添加一些错误处理,以帮助您取消注释读取“log.startLogging(sys.stdout)”的行,但这对于一个答案来说太详细了。如果你的URL出现了一些失败,我通常会通过再次运行代码来重试它们,如果有必要,可能只需要几次失败的URL。