本周开始学习python,所以我想我会用excel来解析文件路径中的一些字段。
我有大约3000个文件都符合命名约定。 /Household/LastName.FirstName.Account.Doctype.Date.extension
例如,可以命名其中一个文件:Cosby.Bill..Profile.2006.doc 并且完整路径是/ Volumes / HD / Organized Files / Cosby,Bill / Cosby.Bill..Profile.2006.doc
在这种情况下:
考斯比,比尔将成为家庭
家庭(Cosby,Bill)是实际文件的封闭文件夹
比尔将是第一个名字
Cosby 将是姓氏
帐户字段已省略
个人资料是doctype
2006 是日期
doc 是扩展程序
所有这些文件都位于此目录/ Volumes / HD / Organized Files /我使用终端和ls将所有文件的列表放到桌面上的.txt文件中,我试图解析来自文件路径分为类似上面的示例。理想情况下,我想输出到csv,每个类别都有一列。这是我丑陋的代码:
def main():
file = open('~/Desktop/client_docs.csv', "rb")
output = open('~/Desktop/client_docs_parsed.txt', "wb")
for line in file:
i = line.find(find_nth(line, '/', 2))
beghouse = line[i + len(find_nth(line, '/', 2)):]
endhouse = beghouse.find('/')
household = beghouse[:endhouse]
lastn = (line[line.find(household):])[(line[line.find(household):]).find('/') + 1:(line[line.find(household):]).find('.')]
firstn = line[line.find('.') + 1: line.find('.', line.find('.') + 1)]
acct = line[line.find('{}.{}.'.format(lastn,firstn)) + len('{}.{}.'.format(lastn,firstn)):line.find('.',line.find('{}.{}.'.format(lastn,firstn)) + len('{}.{}.'.format(lastn,firstn)))]
doctype_beg = line[line.find('{}.{}.{}.'.format(lastn, firstn, acct)) + len('{}.{}.{}.'.format(lastn, firstn, acct)):]
doctype = doctype_beg[:doctype_beg.find('.')]
date_beg = line[line.find('{}/{}.{}.{}.{}.'.format(household,lastn,firstn,acct,doctype)) + len('{}/{}.{}.{}.{}.'.format(household,lastn,firstn,acct,doctype)):]
date = date_beg[:date_beg.find('.')]
print '"',household, '"','"',lastn, '"','"',firstn, '"','"',acct, '"','"',doctype, '"','"',date,'"'
def find_nth(body, s_term, n):
start = body[::-1].find(s_term)
while start >= 0 and n > 1:
start = body[::-1].find(s_term, start+len(s_term))
n -= 1
return ((body[::-1])[start:])[::-1]
if __name__ == "__main__": main()
它似乎工作正常,但是当有另一个封闭文件夹时我会遇到问题,然后它会转移所有我的字段...例如,而不是文件驻留在
/ Volumes / HD / Organized Files / Cosby,Bill /
at at / Volumes / HD / Organized Files / Resigned / Cosby,Bill /
我知道必须采用不那么笨重的方式来解决这个问题。
答案 0 :(得分:1)
这是一个比您的函数find_nth()
更实用的工具:
rstrip()
def find_nth(body, s_term, n):
start = body[::-1].find(s_term)
print '------------------------------------------------'
print 'body[::-1]\n',body[::-1]
print '\nstart == %s' % start
while start >= 0 and n > 1:
start = body[::-1].find(s_term, start+len(s_term))
print 'n == %s start == %s' % (n,start)
n -= 1
print '\n (body[::-1])[start:]\n',(body[::-1])[start:]
print '\n((body[::-1])[start:])[::-1]\n',((body[::-1])[start:])[::-1]
print '---------------\n'
return ((body[::-1])[start:])[::-1]
def cool_find_nth(body, s_term, n):
assert(len(s_term)==1)
return body.rsplit(s_term,n)[0] + s_term
ss = 'One / Two / Three / Four / Five / Six / End'
print 'the string\n%s\n' % ss
print ('================================\n'
"find_nth(ss, '/', 3)\n%s" % find_nth(ss, '/', 3) )
print '================================='
print "cool_find_nth(ss, '/', 3)\n%s" % cool_find_nth(ss, '/', 3)
结果
the string
One / Two / Three / Four / Five / Six / End
------------------------------------------------
body[::-1]
dnE / xiS / eviF / ruoF / eerhT / owT / enO
start == 4
n == 3 start == 10
n == 2 start == 17
(body[::-1])[start:]
/ ruoF / eerhT / owT / enO
((body[::-1])[start:])[::-1]
One / Two / Three / Four /
---------------
================================
find_nth(ss, '/', 3)
One / Two / Three / Four /
=================================
cool_find_nth(ss, '/', 3)
One / Two / Three / Four /
这是另一个非常实用的工具:正则表达式
import re
reg = re.compile('/'
'([^/.]*?)/'
'([^/.]*?)\.'
'([^/.]*?)\.'
'([^/.]*?)\.'
'([^/.]*?)\.'
'([^/.]*?)\.'
'[^/.]+\Z')
def main():
#file = open('~/Desktop/client_docs.csv', "rb")
#output = open('~/Desktop/client_docs_parsed.txt', "wb")
li = ['/Household/LastName.FirstName.Account.Doctype.Date.extension',
'- /Volumes/HD/Organized Files/Cosby, Bill/Cosby.Bill..Profile.2006.doc']
for line in li:
print "line == %r" % line
household,lastn,firstn,acct,doctype,date = reg.search(line).groups('')
print ('household == %r\n'
'lastn == %r\n'
'firstn == %r\n'
'acct == %r\n'
'doctype == %r\n'
'date == %r\n'
% (household,lastn,firstn,acct,doctype,date))
if __name__ == "__main__": main()
结果
line == '/Household/LastName.FirstName.Account.Doctype.Date.extension'
household == 'Household'
lastn == 'LastName'
firstn == 'FirstName'
acct == 'Account'
doctype == 'Doctype'
date == 'Date'
line == '- /Volumes/HD/Organized Files/Cosby, Bill/Cosby.Bill..Profile.2006.doc'
household == 'Cosby, Bill'
lastn == 'Cosby'
firstn == 'Bill'
acct == ''
doctype == 'Profile'
date == '2006'
当我发布上一次编辑时,我想知道我的大脑在哪里。以下工作也是如此:
rig = re.compile('[/.]')
rig.split(line)[-7:-1]
答案 1 :(得分:1)
从我可以收集的信息来看,我相信这将作为一种解决方案,不依赖于以前编译的文件列表
import csv
import os, os.path
# Replace this with the directory where the household directories are stored.
directory = "home"
output = open("Output.csv", "wb")
csvf = csv.writer(output)
headerRow = ["Household", "Lastname", "Firstname", "Account", "Doctype",
"Date", "Extension"]
csvf.writerow(headerRow)
for root, households, files in os.walk(directory):
for household in households:
for filename in os.listdir(os.path.join(directory, household)):
# This will create a record for each filename within the "household"
# Then will split the filename out, using the "." as a delimiter
# to get the detail
csvf.writerow([household] + filename.split("."))
output.flush()
output.close()
这使用os库来“遍历”住户列表。然后,对于每个“家庭”,它将收集文件列表。这需要这个列表,在csv文件中生成记录,使用句点作为分隔符来分隔文件的名称。
它利用csv库生成输出,看起来有点像;
"Household,LastName,Firstname,Account,Doctype,Date,Extension"
如果不需要扩展名,则可以通过更改行来省略它:
csvf.writerow([household] + filename.split("."))
到
csvf.writerow([household] + filename.split(".")[-1])
告诉它只使用文件名的最后一部分,然后从headerRow中删除“Extension”字符串。
希望这会有所帮助
答案 2 :(得分:0)
有点不清楚问题是什么,但同时,这是让你开始的事情:
#!/usr/bin/env python
import os
import csv
with open("f1", "rb") as fin:
reader = csv.reader(fin, delimiter='.')
for row in reader:
# split path
row = list(os.path.split(row[0])) + row[1:]
print ','.join(row)
输出:
/Household,LastName,FirstName,Account,Doctype,Date,extension
另一种解释是您希望将每个字段存储在参数中 还有一条额外的道路搞砸了......
这是for循环中row
的样子:
['/Household/LastName', 'FirstName', 'Account', 'Doctype', 'Date', 'extension']
然后解决方案可能是倒退。
将row[-1]
分配给extension
,row[-2]
分配给date
,依此类推。