Google Cloud Dataproc丛集上的Pyspark作业有不良的CPU使用率

时间:2019-05-27 20:12:19

标签: python apache-spark pyspark

我正在Google云集群上运行pyspark作业。问题是我认为CPU使用不正确。在CPU中有足够的电量可以使用。

我不知道几天来尝试搜索的问题可能是什么,现在我在这里寻求帮助。

程序正在爬网以查找链接到所需目标站点的网页。

我一直在尝试搜索如何最好地将作业提交到云,我尝试了更改spark的配置变量,并搜索了问题是否出在集群创建上。


def crawl(iterator):
        s = iterator[0]
        lista = []
        try:

            if 'http' not in s[0:4]:
                s = 'http://' + s

            res = requests.get(s)
            soup = bs4.BeautifulSoup(res.text, 'html.parser')
            soups = soup.find_all('a', href=True)

            for element in soups:
                i = ""

                if 'http' in element['href'][0:4]:
                    i = element['href']
                elif '//' in element['href'][0:2]:
                    i = str("http:" + element['href'])
                elif '/' in element['href'][0]:
                    if s[len(s)-1] == '/':
                        s = s[0:len(s)-1]
                    i = str(s + element['href'])

                if i != "":
                    if s != i:
                        lista.append((s,[i]))

            return lista
        except:
            #print("Error while crawling on site: ", s)
            lista.append(("error", ["error"]))
            return lista

def computeContribs(urls, rank):
        size = len(urls)
        for url in urls:
            yield (url, rank/size)

def process_input(target):
        linked = []
        if ', ' in target:
            for x in target.split(', '):
                linked.append(x)

def process_input(target):
        linked = []
        if ', ' in target:
            for x in target.split(', '):
                linked.append(x)
        else:
            linked.append(target)
        return linked

target = str(sys.argv[1])
depth = int(sys.argv[2])
target = process_input(target)
target2 = target

conf = SparkConf().setAppName("Crawler").setMaster('local[*]')
sc = SparkContext(conf = conf)

links = sc.parallelize([target])
link_filter = sc.parallelize([])
lista = []

ranks = links.flatMap(lambda x: [(v, 1.0) for v in x])
links = links.flatMap(lambda x: [(v, "target") for v in x])

for l in range(0, depth):
    links = links.flatMap(crawl).filter(lambda x: x[0] != x[1])
    rank_links = links.reduceByKey(add) #.groupByKey() worse on big data
    links = links.map(lambda x: (x[1][0], x[0])).reduceByKey(lambda x,y: x)

    RDD1 = rank_links.join(ranks)
    RDD_Contrib = RDD1.flatMap(lambda x: computeContribs(x[1][0], x[1][1]))
    ranks = RDD_Contrib.reduceByKey(add).mapValues(lambda rank:rank*0.85+0.15)

    link_filter += rank_links.flatMap(lambda x: find_link(x[0], x[1], target2))
    link_filter = link_filter.filter(lambda x: x != None)

    links.saveAsTextFile('gs://bucket/crawl/text1')   
    rank_links.saveAsTextFile('gs://bucket/crawl/text2')```


USED IN POWERSHELL VISUAL STUDIO CODE

gcloud dataproc jobs submit pyspark test.py --cluster="cluster-crawl"

USED IN GOOGLE CLOUD SHELL:

gcloud beta dataproc clusters create cluster-crawl --enable-component-gateway --subnet default --zone us-east1-d --master-machine-type n1-standard-2 --master-boot-disk-size 500 --num-workers 3 --worker-machine-type n1-standard-2 --worker-boot-disk-size 500 --image-version 1.3-deb9 --project laboration1-236309 --metadata 'PIP_PACKAGES=bs4' --initialization-actions gs://dataproc- 
initialization-actions/python/pip-install.sh


This is the result in time:

depth = 1 gives around 30 seconds on both cloud and laptop.
depth = 2 gives around 1 min on both cloud and laptop.
depth = 3 gives around 20 min on laptop and 1 hour on cloud.

Larger depth I have not tried but this escalation in time between laptop and cluster seem unrealistic.

[enter image description here][1]

Below is image on CPU utilization
  [1]: https://i.stack.imgur.com/hQ8cL.png

0 个答案:

没有答案