使用Multiprocessing和BeautifulSoup进行Web抓取,并具有MaybeEncodingError和RecursionError

时间:2018-09-01 01:57:29

标签: python selenium beautifulsoup multiprocessing

我正在从事网络抓取项目。第一个下拉菜单有大约800个选项,第二个下拉菜单有20多个值。这样做的过程非常缓慢。因此,我尝试使用多处理,以期可以稍微加快处理速度。但是,我收到了无法解决的错误消息。

我的代码是:

def create_df(city_var, year_var):
    city = Select(driver.find_element_by_css_selector("select[id*='Main_csCity_ddlEntity1']"))
    city.select_by_visible_text(city_var) 
    year = Select(driver.find_element_by_css_selector("select[id*='Main_csCity_ddlYear1']"))
    year.select_by_visible_text(year_var) 
    try:
        driver.find_element_by_xpath('//input[@type="submit"]').click()
    except Exception as e:
        time.sleep(1)
        driver.find_element_by_xpath('//input[@type="submit"]').click()
        print('something wrong:'+city_var+year_var)
    html = driver.page_source
    soup = BeautifulSoup(html, "lxml")
    try:
        small_header = soup.find_all("div",{"class":"ResultsHeader"})
        ret_list = []
        for idx, span in enumerate(small_header[0].find_all("span")):
            if idx in [1,3,5,7]:
                ret_list.append(span.contents[0])
    except:
        print(city_var+year_var)
    try:
        second_header = soup.find_all("tr",{"class":re.compile('Detail.*')})
        ret_list2 = []
        for idx, content in enumerate(second_header):
            if len(content.contents) == 3:
                ret_list2.append([content.contents[1].contents[0], '', '', ''])
            elif len(content.contents) == 7:
                sublist = []
                for idx2 in range(5):
                    if idx2 == 1:
                        continue
                    sublist.append(content.contents[idx2+1].contents[0])
                ret_list2.append(sublist)
            else:
                print('WRONG')
    except:
        print(city_var+year_var)
    ret_list3 = ret_list2[1:]
    ret_list4 = [ret_list+sub for sub in ret_list3]
    return pd.DataFrame(ret_list4)
list_of_city_year = [[x,y] for x in cities1 for y in years]
def return_df(list1):
    df = pd.DataFrame()
    c = list1[0]
    y = list1[1]
    df = df.append(create_df(c, y))
    return df
with Pool(5) as p:
    records = p.map(return_df, list_of_city_year[:100])

错误消息很长。它也输出以前的结果,所以我只把错误部分放在上面:

  

MaybeEncodingError追溯(最近的呼叫   最后)在()         1,其中Pool(5)为p:   ----> 2条记录= p.map(return_df,list_of_city_year [:100])

     

〜/ anaconda3 / lib / python3.6 / multiprocessing / pool.py在map(self,func,   迭代,块大小)       返回列表中的264。       265'''   -> 266返回self._map_async(func,iterable,mapstar,chunksize).get()       267       268 def starmap(自我,函数,可迭代,chunksize = None):

     

〜/ anaconda3 / lib / python3.6 / multiprocessing / pool.py在get(self,   暂停)       642返回self._value       643其他:   -> 644提高self._value       645       646 def set(self,i,obj):

     

MaybeEncodingError:发送结果错误:'[0 1 2
  3 4 5 \ ..... .....   ....]'。原因:'RecursionError('超过了最大递归深度   同时调用Python对象',)'

如果您对如何改进代码以使其更高效有任何建议,请在下面发布。

1 个答案:

答案 0 :(得分:0)

MaybeEncodingError之所以如此,主要是因为您忘记了使用if __name__ == '__main__': 尝试在Windows上创建工作版本时,出现了奇怪的重复错误消息。

Attempt to start a new process before the current process
has finished its bootstrapping phase.

This probably means that you are on Windows and you have
forgotten to use the proper idiom in the main module:

if __name__ == '__main__': 从字面上引述我的评论,因为它是不言自明的:

  

Windows上的错误是因为每个进程都产生了一个新的python进程,该进程会解释python文件等,因此“ if main block”之外的所有内容都将再次执行”

为了便于移植,在运行此模块时必须使用if __name__=='__main__'

import multiprocessing as mp
from multiprocessing import Pool, freeze_support


if __name__ == '__main__':  #you need this in windows
    freeze_support() 
    main()