Question

我正在编写一个脚本来抓取一个网站并下载当前的mp3文件，它使用BeautifulSoup和Python 3.6.3将所有链接抓取到各个艺术家页面中，但是一些抓取的网址包含的内容超过了一个＆＃34;。＆＃34;在它和我开火时

request.get(url, header=<random header using fake_useragent>)

它不会下载该文件，我该如何更正？

例如：

URL > abc.com/mp3/artist/songs/song.com.mp3 (not downloading)
URL > abc.com/mp3/artist/songs/song.mp3 (downloading)

代码：

def download_mp3(url_list_file, download_dir):                                                                                                                                                                   
    with open(url_list_file, 'r', encoding='utf-8') as urls:                                                                                                                                                     

        for url in urls:                                                                                                                                                                                         
            ua = fake_useragent.UserAgent(verify_ssl=False)                                                                                                                                                      
            #header = {'User-Agent': str(ua.chrome.random)}                                                                                                                                                      
            artist_dir = url.split('/')[4]                                                                                                                                                                       
            song_name = url.split('/')[6].replace('\n', '')                                                                                                                                                      
            #corrected_url = ('path:url')                                                                                                                                                                        
            #print(corrected_url)                                                                                                                                                                                
            download_to = os.path.join(download_dir, artist_dir, song_name)                                                                                                                                      
            save_path = os.path.join(download_dir, artist_dir)                                                                                                                                                   
            #print(download_to)                                                                                                                                                                                  
            #print(save_path)                                                                                                                                                                                    
            print(url)                                                                                                                                                                                           
            if(os.path.isdir(save_path) == True):                                                                                                                                                                
                #print ('True')                                                                                                                                                                                  
                header = {'User-Agent': str(ua.random)}                                                                                                                                                          
                lower_download_to = str.lower(song_name)
                #To get correct file name 
                mp3_file_name = os.path.join(download_dir, artist_dir, lower_download_to.replace("www.", "").replace("[", "").replace("]", ""))  
                mp3_file = open(mp3_file_name, "wb")                                                                                                                                                             
                temp_file = requests.get(url, headers=header)                                                                                                                                          
                mp3_file.write(temp_file.content)                                                                                                                                                                
                time.sleep(random.randint(5,10))                                                                                                                                                                 
            else:                                                                                                                                                                                                
                #print('False')  
                #To create Artist folder names 
                os.mkdir(save_path)                                                                                                                                                                              
                header = {'User-Agent': str(ua.random)}                                                                                                                                                          
                lower_download_to = str.lower(download_to)
                #To get correct file name               
                mp3_file_name = os.path.join(download_dir, artist_dir, lower_download_to.replace("www.", "").replace("[", "").replace("]", ""))  
                mp3_file = open(mp3_file_name, "wb")                                                                                                                                                             
                temp_file = requests.get(url, headers=header)                                                                                                                                                    
                mp3_file.write(temp_file.content)                                                                                                                                                                
                time.sleep(random.randint(5,10))                                                                                                                                                                 
    urls.closed

Answer 1

纠正了它取代所有＆＃34;。＆＃34;在URL中除了最后一个＆＃34; .mp3＆＃34;使用百分比编码

string.replace（＆＃39;。＆＃39;，＆＃34;％2E＆＃34;，（string.count（＆＃39;。＆＃39;） - 1））< /强>

Python的request.get（）url包含多个＆＃34;。＆＃34;

1 个答案: