我正在Google云集群上运行pyspark作业。问题是我认为CPU使用不正确。在CPU中有足够的电量可以使用。
我不知道几天来尝试搜索的问题可能是什么,现在我在这里寻求帮助。
程序正在爬网以查找链接到所需目标站点的网页。
我一直在尝试搜索如何最好地将作业提交到云,我尝试了更改spark的配置变量,并搜索了问题是否出在集群创建上。
def crawl(iterator):
s = iterator[0]
lista = []
try:
if 'http' not in s[0:4]:
s = 'http://' + s
res = requests.get(s)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
soups = soup.find_all('a', href=True)
for element in soups:
i = ""
if 'http' in element['href'][0:4]:
i = element['href']
elif '//' in element['href'][0:2]:
i = str("http:" + element['href'])
elif '/' in element['href'][0]:
if s[len(s)-1] == '/':
s = s[0:len(s)-1]
i = str(s + element['href'])
if i != "":
if s != i:
lista.append((s,[i]))
return lista
except:
#print("Error while crawling on site: ", s)
lista.append(("error", ["error"]))
return lista
def computeContribs(urls, rank):
size = len(urls)
for url in urls:
yield (url, rank/size)
def process_input(target):
linked = []
if ', ' in target:
for x in target.split(', '):
linked.append(x)
def process_input(target):
linked = []
if ', ' in target:
for x in target.split(', '):
linked.append(x)
else:
linked.append(target)
return linked
target = str(sys.argv[1])
depth = int(sys.argv[2])
target = process_input(target)
target2 = target
conf = SparkConf().setAppName("Crawler").setMaster('local[*]')
sc = SparkContext(conf = conf)
links = sc.parallelize([target])
link_filter = sc.parallelize([])
lista = []
ranks = links.flatMap(lambda x: [(v, 1.0) for v in x])
links = links.flatMap(lambda x: [(v, "target") for v in x])
for l in range(0, depth):
links = links.flatMap(crawl).filter(lambda x: x[0] != x[1])
rank_links = links.reduceByKey(add) #.groupByKey() worse on big data
links = links.map(lambda x: (x[1][0], x[0])).reduceByKey(lambda x,y: x)
RDD1 = rank_links.join(ranks)
RDD_Contrib = RDD1.flatMap(lambda x: computeContribs(x[1][0], x[1][1]))
ranks = RDD_Contrib.reduceByKey(add).mapValues(lambda rank:rank*0.85+0.15)
link_filter += rank_links.flatMap(lambda x: find_link(x[0], x[1], target2))
link_filter = link_filter.filter(lambda x: x != None)
links.saveAsTextFile('gs://bucket/crawl/text1')
rank_links.saveAsTextFile('gs://bucket/crawl/text2')```
USED IN POWERSHELL VISUAL STUDIO CODE
gcloud dataproc jobs submit pyspark test.py --cluster="cluster-crawl"
USED IN GOOGLE CLOUD SHELL:
gcloud beta dataproc clusters create cluster-crawl --enable-component-gateway --subnet default --zone us-east1-d --master-machine-type n1-standard-2 --master-boot-disk-size 500 --num-workers 3 --worker-machine-type n1-standard-2 --worker-boot-disk-size 500 --image-version 1.3-deb9 --project laboration1-236309 --metadata 'PIP_PACKAGES=bs4' --initialization-actions gs://dataproc-
initialization-actions/python/pip-install.sh
This is the result in time:
depth = 1 gives around 30 seconds on both cloud and laptop.
depth = 2 gives around 1 min on both cloud and laptop.
depth = 3 gives around 20 min on laptop and 1 hour on cloud.
Larger depth I have not tried but this escalation in time between laptop and cluster seem unrealistic.
[enter image description here][1]
Below is image on CPU utilization
[1]: https://i.stack.imgur.com/hQ8cL.png