我正在尝试读取xlsx
文件,将一列中的所有参考编号与文件夹中的文件进行比较,如果它们相对应,则将它们重命名为与该参考编号相关联的电子邮件。
Excel文件具有以下字段:
Reference EmailAddress
1123 bob.smith@yahoo.com
1233 john.drako@gmail.com
1334 samuel.manuel@yahoo.com
... .....
我的文件夹applicants
仅包含名为参考列的 doc 文件:
如何将applicantsCVs
文件夹的内容与excel文件中的 Reference 字段进行比较,如果匹配,请将所有文件重命名为相应的电子邮件地址? / p>
这是到目前为止我尝试过的:
import os
import pandas as pd
dfOne = pd.read_excel('Book2.xlsx', na_values=['NA'], usecols = "A:D")
references = dfOne['Reference']
emailAddress = dfOne['EmailAddress']
cleanedEmailList = [x for x in emailAddress if str(x) != 'nan']
print(cleanedEmailList)
excelArray = []
filesArray = []
for root, dirs, files in os.walk("applicantCVs"):
for filename in files:
print(filename) #Original file name with type 1233.doc
reworkedFile = os.path.splitext(filename)[0]
filesArray.append(reworkedFile)
for entry in references:
excelArray.append(str(entry))
for i in excelArray:
if i in filesArray:
print(i, "corresponds to the file names")
我将参考名称与文件夹内容进行比较,如果相同,则将其打印出来:
for i in excelArray:
if i in filesArray:
print(i, "corresponds to the file names")
我尝试使用os.rename(filename, cleanedEmailList )
重命名它,但是由于cleanedEmailList
是一组电子邮件,因此无法正常工作。
如何匹配和重命名文件?
答案 0 :(得分:2)
基于示例数据:
Reference EmailAddress
1123 bob.smith@yahoo.com
1233 john.drako@gmail.com
nan jane.smith#example.com
1334 samuel.manuel@yahoo.com
首先,您要组装一个dict
,并将引用集作为键,并将新名称作为值:
references = dict(df.dropna(subset=["Reference","EmailAddress"]).set_index("Reference")["EmailAddress"])
{'1123': 'bob.smith@yahoo.com', '1233': 'john.drako@gmail.com', '1334': 'samuel.manuel@yahoo.com'}
请注意,这里的引用是str
。如果它们不在您的原始数据库中,则可以使用astype(str)
然后您使用pathlib.Path
在数据目录中查找所有文件:
files = Path("../data/renames").glob("*")
[WindowsPath('../data/renames/1123.docx'), WindowsPath('../data/renames/1156.pptx'), WindowsPath('../data/renames/1233.txt')]
重命名可以非常简单:
for file in files:
new_name = references.get(file.stem, file.stem )
file.rename(file.with_name(f"{new_name}{file.suffix}"))
references.get
询问新文件名,如果找不到,请使用原始词干。
[WindowsPath('../data/renames/1156.pptx'), WindowsPath('../data/renames/bob.smith@yahoo.com.docx'), WindowsPath('../data/renames/john.drako@gmail.com.txt')]
答案 1 :(得分:0)
如何将“电子邮件助理”(我猜是您的新名字)添加到字典中,其中的键是您的参考数字? 看起来可能像这样:
cor_dict = {}
for i in excelArray:
if i in filesArray:
cor_dict[i] =dfOne['EmailAddress'].at[dfOne.Reference == i]
for entry in cor_dict.items():
path = 'path to file...'
filename = str(entry[0])+'.doc'
new_filename = str(entry[1]).replace('@','_') + '_.doc'
filepath = os.path.join(path, filename)
new_filepath = os.path.join(path,new_filename)
os.rename(filename, new_filename)
答案 2 :(得分:0)
您可以直接使用df.apply()
在数据框中执行此操作:
import glob
import os.path
#Filter out null addresses
df = df.dropna(subset=['EmailAddress'])
#Add a column to check if file exists
df2['Existing_file'] = df2.apply(lambda row: glob.glob("applicantsCVs/{}.*".format(row['Reference'])), axis=1)
df2.apply(lambda row: os.rename(row.Existing_file[0], 'applicantsCVs/{}.{}'.format( row.EmailAddress, row.Existing_file[0].split('.')[-1])) if len(row.Existing_file) else None, axis = 1)
print(df2.Existing_file.map(len), "existing files renamed")
编辑:
现在可以使用glob
模块与任何扩展名(.doc
,.docx
)一起使用
答案 3 :(得分:0)
这是使用简单迭代的一种方法。
例如:
import os
#Sample Data#
#dfOne = pd.DataFrame({'Reference': [1123, 1233, 1334, 4444, 5555],'EmailAddress': ["bob.smith@yahoo.com", "john.drako@gmail.com", "samuel.manuel@yahoo.com", np.nan, "samuel.manuel@yahoo.com"]})
dfOne = pd.read_excel('Book2.xlsx', na_values=['NA'], usecols = "A:D")
dfOne.dropna(inplace=True) #Drop rows with NaN
for root, dirs, files in os.walk("applicantsCVs"):
for file in files:
file_name, ext = os.path.splitext(file)
email = dfOne[dfOne['Reference'].astype(str).str.contains(file_name)]["EmailAddress"]
if email.values:
os.rename(os.path.join(root, file), os.path.join(root, email.values[0]+ext))
或者如果您只有.docx
个文件要重命名
import os
dfOne = pd.read_excel('Book2.xlsx', na_values=['NA'], usecols = "A:D")
dfOne["Reference"] = dfOne["Reference"].astype(str)
dfOne.dropna(inplace=True) #Drop rows with NaN
ext = ".docx"
for root, dirs, files in os.walk("applicantsCVs"):
files = r"\b" + "|".join(os.path.splitext(i)[0] for i in files) + r"\b"
for email, ref in dfOne[dfOne['Reference'].astype(str).str.contains(files, regex=True)].values:
os.rename(os.path.join(root, ref+ext), os.path.join(root, email+ext))
答案 4 :(得分:0)
让我们考虑以下excel表中的示例数据:
Reference EmailAddress
1123 bob.smith@yahoo.com
1233 john.drako@gmail.com
1334 samuel.manuel@yahoo.com
nan python@gmail.com
解决以下问题涉及以下步骤。
从excel工作表"my.xlsx"
正确导入数据。我在这里使用示例数据
import pandas as pd
import os
#import data from excel sheet and drop rows with nan
df = pd.read_excel('my.xlsx').dropna()
#check the head of data if the data is in desirable format
df.head()
您将在此处看到引用中的数据类型为浮点型
将引用列中的数据类型更改为整数,然后更改为字符串
df['Reference']=df.Reference.astype(int, inplace=True)
df = df.astype(str,inplace=True)
df.head()
现在数据采用所需格式
重命名所需文件夹中的文件。压缩“参考”和“ EmailAddress”的列表以用于for循环。
#absolute path to folder. I consider you have the folder "application cv" in the home directory
path_to_files='/home/applicant cv/'
for ref,email in zip(list(df['Reference']),list(df['EmailAddress'])):
try:
os.rename(path_to_files+ref+'.doc',path_to_files+email+'.doc')
except:
print ("File name doesn't exist in the list, I am leaving it as it is")
答案 5 :(得分:0)
步骤1:从excel工作表"Book1.xlsx"
import pandas as pd
df = pd.read_excel (r'path of your file here\Book1.xlsx')
print (df)
步骤2:选择".docx"
文件所在的路径并存储其名称。
仅获取文件名的相关部分进行比较。
mypath = r'path of docx files\doc files'
from os import listdir,rename
from os.path import isfile, join
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
#print(onlyfiles)
currentfilename=onlyfiles[0].split(".")[0]
This is how I stored the files
步骤3:运行循环以检查名称是否与参考匹配。只需使用rename(src,dest)
中的os
函数
for i in range(3):
#print(currentfilename,df['ref'][i])
if str(currentfilename)==str(df['Reference'][i]):
corrosponding_email=df['EmailAddress'][i]
#print(mypath+"\\"+corrosponding_email)
rename(mypath+"\\"+str(currentfilename)+".docx",mypath+"\\"+corrosponding_email+".docx")
通过示例检出代码:https://github.com/Vineet-Dhaimodker