我有从IEEE Xplore下载的大量(数千)pdf文件。
文件名仅包含文件的文章编号。例如
6215021.pdf
现在,如果你访问
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=6215021
您可以找到有关本文的所有可用信息。
如果您查看网站源代码,可以在下面的部分找到:
<meta name="citation_title" content="Decomposition-Based Distributed Control for Continuous-Time Multi-Agent Systems">
<meta name="citation_date" content="Jan. 2013">
<meta name="citation_volume" content="58">
<meta name="citation_issue" content="1">
<meta name="citation_firstpage" content="258">
<meta name="citation_lastpage" content="264">
<meta name="citation_doi" content="10.1109/TAC.2012.2204153">
<meta name="citation_abstract_html_url" content="http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6215021' escapeXml='false'/>">
<meta name="citation_pdf_url" content="http://ieeexplore.ieee.org/iel5/9/6384835/06215021.pdf?arnumber=6215021">
<meta name="citation_issn" content="0018-9286">
<meta name="citation_isbn" content="">
<meta name="citation_language" content="English">
<meta name="citation_keywords" content="
Distributed control;
Output feedback;
Satellites;
Stability criteria;
Standards;
State feedback;
Upper bound;
Distributed control;
linear matrix inequality (LMI);
multi-agent systems;
robust control;">
我想将我所拥有的文件重命名为“firstpage - citation_title.pdf”
我的编程技巧有限(只有一些C,没有关于解析的线索)所以我指望你的帮助。
提前感谢大家!
答案 0 :(得分:1)
您可以使用iTextSharp库编译以下C#代码。 它根据PDF文件的元数据重命名目录中的所有PDF文件,包括其主题或标题。
using System.IO;
using iTextSharp.text.pdf;
namespace BatchRename
{
class Program
{
private static string getTitle(PdfReader reader)
{
string title;
reader.Info.TryGetValue("Title", out title); // Reading PDF file's meta data
return string.IsNullOrWhiteSpace(title) ? string.Empty : title.Trim();
}
private static string getSubject(PdfReader reader)
{
string subject;
reader.Info.TryGetValue("Subject", out subject); // Reading PDF file's meta data
return string.IsNullOrWhiteSpace(subject) ? string.Empty : subject.Trim();
}
static void Main(string[] args)
{
var dir = @"D:\Prog\1390\iTextSharpTests\BatchRename\bin\Release";
if (!dir.EndsWith(@"\"))
dir = dir + @"\";
foreach (var file in Directory.GetFiles(dir, "*.pdf"))
{
var reader = new PdfReader(file);
var title = getTitle(reader);
var subject = getSubject(reader);
reader.Close();
string newFile = string.Empty;
if (!string.IsNullOrWhiteSpace(title))
{
newFile = dir + title + ".pdf";
}
else if (!string.IsNullOrWhiteSpace(subject))
{
newFile = dir + subject + ".pdf";
}
if (!string.IsNullOrWhiteSpace(newFile))
File.Move(file, newFile);
}
}
}
}
答案 1 :(得分:1)
如果您使用的是mac,则可以使用PDF Paper renamer https://itunes.apple.com/app/pdf-paper-renamer/id591593578?mt=12
答案 2 :(得分:0)
这是我在python中的代码。
#!/usr/bin/env python
'''
Created on Sep 28, 2013
@author: dataq <http://stackoverflow.com/users/2585246/dataq>
This is a simple code to rename the paper based on the ORIGINAL FILENAME and their website.
Your are free to use this code, but don't blame me for the error.
I am not writing any documentation, so please read my mind in this code.
USE ON YOUR OWN RISK *evil smirk*
'''
import urllib2, re, time, random
from os import listdir, rename
from os.path import isfile, join
# for every publisher we have different way of scraping
IEEE = 1
SCIENCEDIRECT = 2
# yes, I know, this very bad and stupid web scraping. But it's work at least.
# get title for IEEE paper
# the IEEE paper filename is looks like this '06089032.pdf'
def getIEEETitle(fname):
# get url
number = int(fname.split('.')[0])
targeturl = 'http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber='+str(number)
# open and read from those url
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
ieeePage = opener.open(targeturl).read()
# split every tag in the html. this is the stupid part :p
ieeePageSplit = ieeePage.replace('<','>').split('>')
title = None
# find a tag that start with 'meta name="citation_title" content="'
for i in ieeePageSplit:
if i.startswith('meta name="citation_title" content="'):
# get the paper title
title = i.split('"')[3]
break
# a file name cannot be longer than 255 character (theoretically)
# http://msdn.microsoft.com/en-us/library/aa365247.aspx
return title.strip()[:150]
# get title for Science Direct paper
# the Science Direct paper filename is looks like this '1-s2.0-0031320375900217-main.pdf'
def getScienceDirectTittle(fname):
# get url
number = fname.split('-')[2]
targeturl = 'http://www.sciencedirect.com/science/article/pii/'+number
# open and read from those url
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
sdPage = opener.open(targeturl).read()
# split every tag in the html. this is the stupid part :p
sdPageSplit = sdPage.replace('<','>').split('>')
title = None
for i in range(len(sdPageSplit)):
if sdPageSplit[i].startswith('title'):
title = sdPageSplit[i+1]
break
# a file name cannot be longer than 255 character (theoretically)
# http://msdn.microsoft.com/en-us/library/aa365247.aspx
return title.strip()[:150]
def batchRename(workingdir, site):
# list all file in working directory
files = [ fInput for fInput in listdir(workingdir) if isfile(join(workingdir,fInput)) ]
# compiled regular expression for illegal filename character
reIlegalChar = re.compile(r'([<>:"/\\|?*])')
# rename all files
for f in files:
try:
# find title
if site == IEEE:
title = getIEEETitle(f)
elif site == SCIENCEDIRECT:
title = getScienceDirectTittle(f)
else:
title = None
if title:
# remove illegal file name character
fnew = reIlegalChar.sub(r' ', title) + '.pdf'
print '{} --> {}'.format(f, fnew)
# rename file
rename((workingdir + f), (workingdir + fnew))
print 'Success'
else:
print '{}\nFailed'.format(f)
except:
print '{}\nERROR'.format(f)
# give some random delay, so we will not be blocked (hopefully) :p
time.sleep(random.randrange(10))
if __name__ == '__main__':
print 'Please be patient, it takes time depending on your internet connection speed...'
workingdir = 'C:\\Users\\dataq\\Downloads\\paper\\'
batchRename(workingdir, IEEE)
此代码适用于IEEE和Science Direct文章。您可以将文章放在workingdir
中。当然,您可以将workingdir
的值更改为您自己的文件夹。
在该代码中,我重命名文件夹C:\Users\dataq\Downloads\paper\
中的IEEE文章。如果要重命名Science Direct文章,则必须将batchRename(workingdir, IEEE)
更改为batchRename(workingdir, SCIENCEDIRECT)
您必须确保文章文件名是原始的(原始IEEE文章如下所示:06089032.pdf
,而对于Science Direct文章,看起来像这样:1-s2.0-0031320375900217-main.pdf
)
我不保证这些工具可以正常使用,因此使用它需要您自担风险。