如何使用Python重命名现有列表中的已删除文件

时间:2019-03-29 20:17:30

标签: python regex python-3.x nlp rename

我正在从网站上抓取文件,并想根据计算机上现有的目录名称(或更简单的是,包含这些目录名称的列表)重命名这些文件。这是为了保持一致的命名约定。

例如,我已经有名为:

的目录

Barone Capital Management, Gabagool Alternative Investments, Aprile Asset Management, Webistics Investments

抓取的数据包含一些完全匹配,一些“模糊”匹配以及一些新值:

Barone, Gabagool LLC, Aprile Asset Management, New Name, Webistics Investments

我希望抓取的文件采用现有目录的命名约定。例如,Barone将变为Barone Capital Management,而Gabagool LLC将被重命名为Gabagool Alternative Investments

那么实现这一目标的最佳方法是什么?我查看了Fuzzywuzzy和其他一些库,但是不确定正确的路径是什么。

这是我现有的代码,仅根据锚点命名文件:

import praw
import requests
from bs4 import BeautifulSoup
import urllib.request

url = 'https://old.reddit.com/r/test/comments/b71ug1/testpostr23432432/'
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)

soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('table')[0]

#letter_urls = []
for anchor in table.findAll('a'):
    try:
        if not anchor:
            continue
        fund_name = anchor.text
        letter_link = anchor['href']
        urllib.request.urlretrieve(letter_link, '2018 Q4 ' + fund_name + '.pdf')
    except:
        pass

请注意,目录列表已经创建,并且看起来像这样:

 - /Users/user/Dropbox/Letters/Barone Capital Management
 - /Users/user/Dropbox/Letters/Aprile Asset Management
 - /Users/user/Dropbox/Letters/Webistics Investments
 - /Users/user/Dropbox/Letters/Gabagool Alternative Investments
 - /Users/user/Dropbox/Letters/Ro Capital
 - /Users/user/Dropbox/Letters/Vitoon Capital

2 个答案:

答案 0 :(得分:1)

按照Python: find closest string (from a list) to another string

您可以使用difflib.get_close_matches(https://docs.python.org/3/library/difflib.html#difflib.get_close_matches)在列表中查找最相似的字符串。您的列表将是您已经拥有的绝对路径的文件夹:

import difflib
best_options = get_close_matches(fund_name, candidates, n=1)

if best_options:
    directory = best_options[0]
else:
    directory = 'New Name'

答案 1 :(得分:0)

使其正常工作:

best_options = get_close_matches(fund_name, candidates, n=1, cutoff=.5)

try:
     if best_options:
       fund_name = (downloads_folder + period + " " + fund_name + ".pdf")
       os.rename(fund_name, downloads_folder + period + " " + best_options[0] + ".pdf" )
    except:
        pass