我正在从网站上抓取文件,并想根据计算机上现有的目录名称(或更简单的是,包含这些目录名称的列表)重命名这些文件。这是为了保持一致的命名约定。
例如,我已经有名为:
的目录 Barone Capital Management, Gabagool Alternative Investments, Aprile Asset Management, Webistics Investments
抓取的数据包含一些完全匹配,一些“模糊”匹配以及一些新值:
Barone, Gabagool LLC, Aprile Asset Management, New Name, Webistics Investments
我希望抓取的文件采用现有目录的命名约定。例如,Barone
将变为Barone Capital Management
,而Gabagool LLC
将被重命名为Gabagool Alternative Investments
。
那么实现这一目标的最佳方法是什么?我查看了Fuzzywuzzy和其他一些库,但是不确定正确的路径是什么。
这是我现有的代码,仅根据锚点命名文件:
import praw
import requests
from bs4 import BeautifulSoup
import urllib.request
url = 'https://old.reddit.com/r/test/comments/b71ug1/testpostr23432432/'
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('table')[0]
#letter_urls = []
for anchor in table.findAll('a'):
try:
if not anchor:
continue
fund_name = anchor.text
letter_link = anchor['href']
urllib.request.urlretrieve(letter_link, '2018 Q4 ' + fund_name + '.pdf')
except:
pass
请注意,目录列表已经创建,并且看起来像这样:
- /Users/user/Dropbox/Letters/Barone Capital Management
- /Users/user/Dropbox/Letters/Aprile Asset Management
- /Users/user/Dropbox/Letters/Webistics Investments
- /Users/user/Dropbox/Letters/Gabagool Alternative Investments
- /Users/user/Dropbox/Letters/Ro Capital
- /Users/user/Dropbox/Letters/Vitoon Capital
答案 0 :(得分:1)
按照Python: find closest string (from a list) to another string
您可以使用difflib.get_close_matches(https://docs.python.org/3/library/difflib.html#difflib.get_close_matches)在列表中查找最相似的字符串。您的列表将是您已经拥有的绝对路径的文件夹:
import difflib
best_options = get_close_matches(fund_name, candidates, n=1)
if best_options:
directory = best_options[0]
else:
directory = 'New Name'
答案 1 :(得分:0)
使其正常工作:
best_options = get_close_matches(fund_name, candidates, n=1, cutoff=.5)
try:
if best_options:
fund_name = (downloads_folder + period + " " + fund_name + ".pdf")
os.rename(fund_name, downloads_folder + period + " " + best_options[0] + ".pdf" )
except:
pass