所有目录或列表值在python中使用相同的值进行更新

时间:2018-03-28 06:21:33

标签: python directory web-crawler updates

背景

使用Python,我通过迭代列表来抓取存储在列表中的网站列表。从列表中收集每个网站URL并通过功能进行爬网。返回该函数的响应,并将已爬网的数据添加到目录中。

问题

每次从爬网函数调用返回新响应并将响应添加到字典中时,目录中的所有值都将使用最新值进行更新。我还尝试将响应添加到列表中,列表中的所有值也会使用最新的响应值进行更新。

调试已尝试

我在将它们添加到字典或列表之前和之后的每个迭代中打印了单个响应,并且这些响应在添加到目录或列表之前和之后是相同的,并且在每次迭代中是不同的。这意味着响应根据预期的行为而变得明显。但整个列表都会以最新值更新。

代码

for jobListingPage in jobListingPages:
    try:
        r = urllib.urlopen(jobListingPage).read()
        soup = BeautifulSoup(r, "html.parser")
        jobsSummaryMarkup = soup.find_all("h2", class_=["g-col10"])
        i = 0
        for jobSummaryMarkup in jobsSummaryMarkup:
            jobDetailsURL = base_url_sof+str(jobSummaryMarkup.a["href"])
            jobDetailsFindRes = find_job_details(jobDetailsURL)
            if(jobDetailsFindRes[0] == 0):
                #print("******crawled response before adding")
                #print(jobDetailsFindRes[1])
                i=i+1
                all_jobs_data["job "+str(i)] = jobDetailsFindRes[1]
                #print("******crawled response after adding")
                #print(jobDetailsFindRes[1])
                #print("******cumulative dictionary")
                #print(all_jobs_data)
                #print("###########################################")
        return([0, all_jobs_data])
    except Exception as e:
        return([-1, e])

输出上述代码

取消注释print语句后的输出,获得以下输出。经过三次迭代,即从列表中抓取三个网站..

******crawled response before adding
{'location_name': 'Bengaluru', 'tags': ['user-interface', 'html5', 'javascript', 'angularjs', 'reactjs'], 'job_url': 'http://www.stackoverflow.com/jobs/170630/ui-front-end-developer-citrix', 'Experience level': ['Mid-Level', ' Senior', ' Lead'], 'Job type': ['Permanent'], 'Role': ['Frontend Developer'], 'company_name': 'Citrix', 'job_name': 'UI /Front-End Developer'}
******crawled response after adding
{'location_name': 'Bengaluru', 'tags': ['user-interface', 'html5', 'javascript', 'angularjs', 'reactjs'], 'job_url': 'http://www.stackoverflow.com/jobs/170630/ui-front-end-developer-citrix', 'Experience level': ['Mid-Level', ' Senior', ' Lead'], 'Job type': ['Permanent'], 'Role': ['Frontend Developer'], 'company_name': 'Citrix', 'job_name': 'UI /Front-End Developer'}
******cumulative dictionary
{'job 1': {'location_name': 'Bengaluru', 'tags': ['user-interface', 'html5', 'javascript', 'angularjs', 'reactjs'], 'job_url': 'http://www.stackoverflow.com/jobs/170630/ui-front-end-developer-citrix', 'Experience level': ['Mid-Level', ' Senior', ' Lead'], 'Job type': ['Permanent'], 'Role': ['Frontend Developer'], 'company_name': 'Citrix', 'job_name': 'UI /Front-End Developer'}}
#########################################
******crawled response before adding
{'location_name': 'Bengaluru', 'tags': ['python', 'django', 'java'], 'job_url': 'http://www.stackoverflow.com/jobs/171885/full-stack-developer-mishipay', 'Industry': ['Mobile Payments', ' POS', ' Retail'], 'Experience level': ['Mid-Level'], 'Job type': ['Permanent'], 'Role': ['Full Stack Developer'], 'company_name': 'MishiPay', 'job_name': 'Full Stack Developer'}
******crawled response after adding
{'location_name': 'Bengaluru', 'tags': ['python', 'django', 'java'], 'job_url': 'http://www.stackoverflow.com/jobs/171885/full-stack-developer-mishipay', 'Industry': ['Mobile Payments', ' POS', ' Retail'], 'Experience level': ['Mid-Level'], 'Job type': ['Permanent'], 'Role': ['Full Stack Developer'], 'company_name': 'MishiPay', 'job_name': 'Full Stack Developer'}
******cumulative dictionary
{'job 1': {'location_name': 'Bengaluru', 'tags': ['python', 'django', 'java'], 'job_url': 'http://www.stackoverflow.com/jobs/171885/full-stack-developer-mishipay', 'Industry': ['Mobile Payments', ' POS', ' Retail'], 'Experience level': ['Mid-Level'], 'Job type': ['Permanent'], 'Role': ['Full Stack Developer'], 'company_name': 'MishiPay', 'job_name': 'Full Stack Developer'}, 'job 2': {'location_name': 'Bengaluru', 'tags': ['python', 'django', 'java'], 'job_url': 'http://www.stackoverflow.com/jobs/171885/full-stack-developer-mishipay', 'Industry': ['Mobile Payments', ' POS', ' Retail'], 'Experience level': ['Mid-Level'], 'Job type': ['Permanent'], 'Role': ['Full Stack Developer'], 'company_name': 'MishiPay', 'job_name': 'Full Stack Developer'}}
#########################################
******crawled response before adding
{'location_name': 'Hyderabad', 'tags': ['architecture', 'web-services', 'togaf', 'websecurity', 'bigdata'], 'job_url': 'http://www.stackoverflow.com/jobs/168402/web-security-architect-in-fintech-big-data-paysafe', 'Industry': ['Financial Services', ' Financial Technology', ' Information Technology'], 'Experience level': ['Mid-Level', ' Senior'], 'Job type': ['Permanent'], 'Role': ['System Administrator'], 'company_name': 'Paysafe', 'job_name': 'Web Security Architect  in Fintech & Big Data'}
******crawled response after adding
{'location_name': 'Hyderabad', 'tags': ['architecture', 'web-services', 'togaf', 'websecurity', 'bigdata'], 'job_url': 'http://www.stackoverflow.com/jobs/168402/web-security-architect-in-fintech-big-data-paysafe', 'Industry': ['Financial Services', ' Financial Technology', ' Information Technology'], 'Experience level': ['Mid-Level', ' Senior'], 'Job type': ['Permanent'], 'Role': ['System Administrator'], 'company_name': 'Paysafe', 'job_name': 'Web Security Architect  in Fintech & Big Data'}
******cumulative dictionary
{'job 1': {'location_name': 'Hyderabad', 'tags': ['architecture', 'web-services', 'togaf', 'websecurity', 'bigdata'], 'job_url': 'http://www.stackoverflow.com/jobs/168402/web-security-architect-in-fintech-big-data-paysafe', 'Industry': ['Financial Services', ' Financial Technology', ' Information Technology'], 'Experience level': ['Mid-Level', ' Senior'], 'Job type': ['Permanent'], 'Role': ['System Administrator'], 'company_name': 'Paysafe', 'job_name': 'Web Security Architect  in Fintech & Big Data'}, 'job 2': {'location_name': 'Hyderabad', 'tags': ['architecture', 'web-services', 'togaf', 'websecurity', 'bigdata'], 'job_url': 'http://www.stackoverflow.com/jobs/168402/web-security-architect-in-fintech-big-data-paysafe', 'Industry': ['Financial Services', ' Financial Technology', ' Information Technology'], 'Experience level': ['Mid-Level', ' Senior'], 'Job type': ['Permanent'], 'Role': ['System Administrator'], 'company_name': 'Paysafe', 'job_name': 'Web Security Architect  in Fintech & Big Data'}, 'job 3': {'location_name': 'Hyderabad', 'tags': ['architecture', 'web-services', 'togaf', 'websecurity', 'bigdata'], 'job_url': 'http://www.stackoverflow.com/jobs/168402/web-security-architect-in-fintech-big-data-paysafe', 'Industry': ['Financial Services', ' Financial Technology', ' Information Technology'], 'Experience level': ['Mid-Level', ' Senior'], 'Job type': ['Permanent'], 'Role': ['System Administrator'], 'company_name': 'Paysafe', 'job_name': 'Web Security Architect  in Fintech & Big Data'}}
#########################################

最后一项是通过整个字典传递并更新所有项目。如果我将最后一项附加到列表中,则会发生同样的情况,整个列表会使用最后一项进行更新。

如何将不同的项添加到字典中,而不是通过相同的最后一项更新整个目录?

编辑:添加将响应附加到列表而不是添加到字典的代码版本。

代码

for jobListingPage in jobListingPages:
    try:
        r = urllib.urlopen(jobListingPage).read()
        soup = BeautifulSoup(r, "html.parser")
        jobsSummaryMarkup = soup.find_all("h2", class_=["g-col10"])
        for jobSummaryMarkup in jobsSummaryMarkup:
            jobDetailsURL = base_url_sof+str(jobSummaryMarkup.a["href"])
            jobDetailsFindRes = find_job_details(jobDetailsURL)
            if(jobDetailsFindRes[0] == 0):
                #print("******crawled response before adding")
                #print(jobDetailsFindRes[1])
                all_jobs_data_list.append(jobDetailsFindRes[1])
                #print("******crawled response after adding")
                #print(jobDetailsFindRes[1])
                #print("******cumulative list")
                #print(all_jobs_data_list)
                #print("###########################################")
        return([0, all_jobs_data])
    except Exception as e:
        return([-1, e])

以上代码的输出为:

******crawled response before adding
{'location_name': 'Bengaluru', 'tags': ['user-interface', 'html5', 'javascript', 'angularjs', 'reactjs'], 'job_url': 'http://www.stackoverflow.com/jobs/170630/ui-front-end-developer-citrix', 'Experience level': ['Mid-Level', ' Senior', ' Lead'], 'Job type': ['Permanent'], 'Role': ['Frontend Developer'], 'company_name': 'Citrix', 'job_name': 'UI /Front-End Developer'}
******crawled response after adding
{'location_name': 'Bengaluru', 'tags': ['user-interface', 'html5', 'javascript', 'angularjs', 'reactjs'], 'job_url': 'http://www.stackoverflow.com/jobs/170630/ui-front-end-developer-citrix', 'Experience level': ['Mid-Level', ' Senior', ' Lead'], 'Job type': ['Permanent'], 'Role': ['Frontend Developer'], 'company_name': 'Citrix', 'job_name': 'UI /Front-End Developer'}
******cumulative dictionary
[{'location_name': 'Bengaluru', 'tags': ['user-interface', 'html5', 'javascript', 'angularjs', 'reactjs'], 'job_url': 'http://www.stackoverflow.com/jobs/170630/ui-front-end-developer-citrix', 'Experience level': ['Mid-Level', ' Senior', ' Lead'], 'Job type': ['Permanent'], 'Role': ['Frontend Developer'], 'company_name': 'Citrix', 'job_name': 'UI /Front-End Developer'}]
#########################################
******crawled response before adding
{'location_name': 'Bengaluru', 'tags': ['python', 'django', 'java'], 'job_url': 'http://www.stackoverflow.com/jobs/171885/full-stack-developer-mishipay', 'Industry': ['Mobile Payments', ' POS', ' Retail'], 'Experience level': ['Mid-Level'], 'Job type': ['Permanent'], 'Role': ['Full Stack Developer'], 'company_name': 'MishiPay', 'job_name': 'Full Stack Developer'}
******crawled response after adding
{'location_name': 'Bengaluru', 'tags': ['python', 'django', 'java'], 'job_url': 'http://www.stackoverflow.com/jobs/171885/full-stack-developer-mishipay', 'Industry': ['Mobile Payments', ' POS', ' Retail'], 'Experience level': ['Mid-Level'], 'Job type': ['Permanent'], 'Role': ['Full Stack Developer'], 'company_name': 'MishiPay', 'job_name': 'Full Stack Developer'}
******cumulative dictionary
[{'location_name': 'Bengaluru', 'tags': ['python', 'django', 'java'], 'job_url': 'http://www.stackoverflow.com/jobs/171885/full-stack-developer-mishipay', 'Industry': ['Mobile Payments', ' POS', ' Retail'], 'Experience level': ['Mid-Level'], 'Job type': ['Permanent'], 'Role': ['Full Stack Developer'], 'company_name': 'MishiPay', 'job_name': 'Full Stack Developer'}, {'location_name': 'Bengaluru', 'tags': ['python', 'django', 'java'], 'job_url': 'http://www.stackoverflow.com/jobs/171885/full-stack-developer-mishipay', 'Industry': ['Mobile Payments', ' POS', ' Retail'], 'Experience level': ['Mid-Level'], 'Job type': ['Permanent'], 'Role': ['Full Stack Developer'], 'company_name': 'MishiPay', 'job_name': 'Full Stack Developer'}]
#########################################
******crawled response before adding
{'location_name': 'Hyderabad', 'tags': ['architecture', 'web-services', 'togaf', 'websecurity', 'bigdata'], 'job_url': 'http://www.stackoverflow.com/jobs/168402/web-security-architect-in-fintech-big-data-paysafe', 'Industry': ['Financial Services', ' Financial Technology', ' Information Technology'], 'Experience level': ['Mid-Level', ' Senior'], 'Job type': ['Permanent'], 'Role': ['System Administrator'], 'company_name': 'Paysafe', 'job_name': 'Web Security Architect  in Fintech & Big Data'}
******crawled response after adding
{'location_name': 'Hyderabad', 'tags': ['architecture', 'web-services', 'togaf', 'websecurity', 'bigdata'], 'job_url': 'http://www.stackoverflow.com/jobs/168402/web-security-architect-in-fintech-big-data-paysafe', 'Industry': ['Financial Services', ' Financial Technology', ' Information Technology'], 'Experience level': ['Mid-Level', ' Senior'], 'Job type': ['Permanent'], 'Role': ['System Administrator'], 'company_name': 'Paysafe', 'job_name': 'Web Security Architect  in Fintech & Big Data'}
******cumulative dictionary
[{'location_name': 'Hyderabad', 'tags': ['architecture', 'web-services', 'togaf', 'websecurity', 'bigdata'], 'job_url': 'http://www.stackoverflow.com/jobs/168402/web-security-architect-in-fintech-big-data-paysafe', 'Industry': ['Financial Services', ' Financial Technology', ' Information Technology'], 'Experience level': ['Mid-Level', ' Senior'], 'Job type': ['Permanent'], 'Role': ['System Administrator'], 'company_name': 'Paysafe', 'job_name': 'Web Security Architect  in Fintech & Big Data'}, {'location_name': 'Hyderabad', 'tags': ['architecture', 'web-services', 'togaf', 'websecurity', 'bigdata'], 'job_url': 'http://www.stackoverflow.com/jobs/168402/web-security-architect-in-fintech-big-data-paysafe', 'Industry': ['Financial Services', ' Financial Technology', ' Information Technology'], 'Experience level': ['Mid-Level', ' Senior'], 'Job type': ['Permanent'], 'Role': ['System Administrator'], 'company_name': 'Paysafe', 'job_name': 'Web Security Architect  in Fintech & Big Data'}, {'location_name': 'Hyderabad', 'tags': ['architecture', 'web-services', 'togaf', 'websecurity', 'bigdata'], 'job_url': 'http://www.stackoverflow.com/jobs/168402/web-security-architect-in-fintech-big-data-paysafe', 'Industry': ['Financial Services', ' Financial Technology', ' Information Technology'], 'Experience level': ['Mid-Level', ' Senior'], 'Job type': ['Permanent'], 'Role': ['System Administrator'], 'company_name': 'Paysafe', 'job_name': 'Web Security Architect  in Fintech & Big Data'}]
#########################################

jobListingPages的示例数据

['https://stackoverflow.com/jobs?sort=p&l=India&d=100&u=Km', 'https://stackoverflow.com/jobs?l=India&d=100&u=Km&sort=i&pg=2']

jobListingPages的示例数据

http://www.stackoverflow.com/jobs/170630/ui-front-end-developer-citrix
http://www.stackoverflow.com/jobs/171885/full-stack-developer-mishipay
http://www.stackoverflow.com/jobs/168402/web-security-architect-in-fintech-big-data-paysafe

2 个答案:

答案 0 :(得分:0)

我相信i = 0是罪魁祸首。请将它移到外部for循环外面再试一次。作业计数器将在列表的每个URL元素处重置,并更新相同键的现有值(例如:作业1)

答案 1 :(得分:0)

解决了它。

我不知道它是如何工作的,但all_jobs_data_list.append(str(jobDetailsFindRes[1]))列表而不是all_jobs_data_list.append(jobDetailsFindRes[1])为我做了工作。

同样,all_jobs_data_list["job "+str(i)] = str(jobDetailsFindRes[1])代替all_jobs_data_list["job "+str(i)] = jobDetailsFindRes[1] 条目不同。

如果有人能够解释这一点,我们将不胜感激:)