在iText中的JavaScript操作中搜索PDF以查找特定字符串

时间:2016-12-11 19:22:28

标签: pdf itext

我的目标是在PDF中的注释中查找给定模式的JavaScript。为此,我提供了以下代码:

public static void main(String[] args) {

        try {

            // Reads and parses a PDF document
            PdfReader reader = new PdfReader("Test.pdf");

            // For each PDF page
            for (int i = 1; i <= reader.getNumberOfPages(); i++) {

                // Get a page a PDF page
                PdfDictionary page = reader.getPageN(i);
                // Get all the annotations of page i
                PdfArray annotsArray = page.getAsArray(PdfName.ANNOTS);

                // If page does not have annotations
                if (page.getAsArray(PdfName.ANNOTS) == null) {
                    continue;
                }

                // For each annotation
                for (int j = 0; j < annotsArray.size(); ++j) {

                    // For current annotation
                    PdfDictionary curAnnot = annotsArray.getAsDict(j);

                    // check if has JS as described below
                 PdfDictionary AnnotationAction = AnnotationDictionary.GetAsDict(PdfName.A);
                 // test if it is a JavaScript action
                 if (AnnotationAction.Get(PdfName.S).Equals(PdfName.JavaScript)){
                 // what here?
                 }


                }
            }

        } catch (Exception e) {
            e.printStackTrace();
        }

    }

据我所知,比较字符串是由StringCompare library完成的。问题是它比较了两个字符串,但我很想知道注释中的JavaScript操作是否以(或包含)此字符串开头:if (this.hostContainer) { try {

那么,如何检查注释中的JavaScript是否包含上述字符串?

修改 JS的示例页面位于:pdf with JS

1 个答案:

答案 0 :(得分:1)

JavaScript操作在ISO 32000-1中定义如下:

  

12.6.4.16 JavaScript操作

     

在调用JavaScript动作时,符合标准的处理器应执行用JavaScript编程语言编写的脚本。根据脚本的性质,文档中的各种交互式表单域可以更新其值或更改其视觉外观。 Mozilla开发中心的客户端JavaScript参考和Adobe JavaScript for Acrobat API参考(参见参考书目)详细介绍了JavaScript脚本的内容和效果。表217显示了特定于此类操作的操作字典条目。

     

表217 - 特定于JavaScript操作的其他条目

     

<强>关键   类型   

     

<强>取值   名称   (必需)此词典描述的操作类型;应该是用于JavaScript操作的JavaScript。

     

<强> JS   文字字符串或   文字流   (必需)包含要执行的JavaScript脚本的文本字符串或文本流。   PDFDocEncoding或Unicode编码(后者由Unicode前缀U + FEFF标识)应用于编码字符串或流的内容。

     

为了支持在 JavaScript 脚本中使用参数化函数调用,PDF文档名称字典中的JavaScript条目(参见7.7.4,“名称字典”)可能包含映射名称的名称树字符串到文档级JavaScript操作。打开文档时,应执行此名称树中的所有操作,定义JavaScript函数以供文档中的其他脚本使用。

因此,如果您有兴趣知道注释中的JavaScript操作是否以(或包含)此字符串开头(或包含):if (this.hostContainer) { try {

 if (AnnotationAction.Get(PdfName.S).Equals(PdfName.JavaScript)){
 // what here?
 }

您可能希望首先检查AnnotationAction.Get(PdfName.JS)PdfString还是PdfStream,在任何一种情况下都将内容检索为字符串,并检查它是否或任何功能调用(函数可能在JavaScript名称树中定义)包含使用常用字符串比较方法搜索的字符串。

示例代码

我拿了你的代码,清理了一下(特别是它是C#和Java的混合),并添加了如上所述的代码,检查注释操作元素中的直接JavaScript代码:

Java版

System.out.println("file.pdf - Looking for special JavaScript actions.");
// Reads and parses a PDF document
PdfReader reader = new PdfReader(resource);

// For each PDF page
for (int i = 1; i <= reader.getNumberOfPages(); i++)
{
    System.out.printf("\nPage %d\n", i);
    // Get a page a PDF page
    PdfDictionary page = reader.getPageN(i);
    // Get all the annotations of page i
    PdfArray annotsArray = page.getAsArray(PdfName.ANNOTS);

    // If page does not have annotations
    if (annotsArray == null)
    {
        System.out.printf("No annotations.\n", i);
        continue;
    }

    // For each annotation
    for (int j = 0; j < annotsArray.size(); ++j)
    {
        System.out.printf("Annotation %d - ", j);

        // For current annotation
        PdfDictionary curAnnot = annotsArray.getAsDict(j);

        // check if has JS as described below
        PdfDictionary annotationAction = curAnnot.getAsDict(PdfName.A);
        if (annotationAction == null)
        {
            System.out.print("no action");
        }
        // test if it is a JavaScript action
        else if (PdfName.JAVASCRIPT.equals(annotationAction.get(PdfName.S)))
        {
            PdfObject scriptObject = annotationAction.getDirectObject(PdfName.JS);
            if (scriptObject == null)
            {
                System.out.print("missing JS entry");
                continue;
            }
            final String script;
            if (scriptObject.isString())
                script = ((PdfString)scriptObject).toUnicodeString();
            else if (scriptObject.isStream())
            {
                try (   ByteArrayOutputStream baos = new ByteArrayOutputStream()    )
                {
                    ((PdfStream)scriptObject).writeContent(baos);
                    script = baos.toString("ISO-8859-1");
                }
            }
            else
            {
                System.out.println("malformed JS entry");
                continue;
            }

            if (script.contains("if (this.hostContainer) { try {"))
                System.out.print("contains test string - ");

            System.out.printf("\n---\n%s\n---", script);
            // what here?
        }
        else
        {
            System.out.print("no JavaScript action");
        }
        System.out.println();
    }
}

(测试SearchActionJavaScript,方法testSearchJsActionInFile

C#版

using (PdfReader reader = new PdfReader(sourcePath))
{
    Console.WriteLine("file.pdf - Looking for special JavaScript actions.");

    // For each PDF page
    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
        Console.Write("\nPage {0}\n", i);
        // Get a page a PDF page
        PdfDictionary page = reader.GetPageN(i);
        // Get all the annotations of page i
        PdfArray annotsArray = page.GetAsArray(PdfName.ANNOTS);

        // If page does not have annotations
        if (annotsArray == null)
        {
            Console.WriteLine("No annotations.");
            continue;
        }

        // For each annotation
        for (int j = 0; j < annotsArray.Size; ++j)
        {
            Console.Write("Annotation {0} - ", j);

            // For current annotation
            PdfDictionary curAnnot = annotsArray.GetAsDict(j);

            // check if has JS as described below
            PdfDictionary annotationAction = curAnnot.GetAsDict(PdfName.A);
            if (annotationAction == null)
            {
                Console.Write("no action");
            }
            // test if it is a JavaScript action
            else if (PdfName.JAVASCRIPT.Equals(annotationAction.Get(PdfName.S)))
            {
                PdfObject scriptObject = annotationAction.GetDirectObject(PdfName.JS);
                if (scriptObject == null)
                {
                    Console.WriteLine("missing JS entry");
                    continue;
                }
                String script;
                if (scriptObject.IsString())
                    script = ((PdfString)scriptObject).ToUnicodeString();
                else if (scriptObject.IsStream())
                {
                    using (MemoryStream stream = new MemoryStream())
                    {
                        ((PdfStream)scriptObject).WriteContent(stream);
                        script = stream.ToString();
                    }
                }
                else
                {
                    Console.WriteLine("malformed JS entry");
                    continue;
                }

                if (script.Contains("if (this.hostContainer) { try {"))
                    Console.Write("contains test string - ");

                Console.Write("\n---\n{0}\n---", script);
                // what here?
            }
            else
            {
                Console.Write("no JavaScript action");
            }
            Console.WriteLine();
        }
    }
}

输出

针对您的示例文件运行任一版本时,会得到:

file.pdf - Looking for special JavaScript actions.

Page 1
Annotation 0 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_vii', 0]);
} catch(e) { console.println(e); }};
---
Annotation 1 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_ix', 0]);
} catch(e) { console.println(e); }};
---
Annotation 2 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_xi', 0]);
} catch(e) { console.println(e); }};
---
Annotation 3 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_3', 0]);
} catch(e) { console.println(e); }};
---
Annotation 4 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_15', 0]);
} catch(e) { console.println(e); }};
---
Annotation 5 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_37', 0]);
} catch(e) { console.println(e); }};
---
Annotation 6 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_57', 0]);
} catch(e) { console.println(e); }};
---
Annotation 7 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_81', 0]);
} catch(e) { console.println(e); }};
---
Annotation 8 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_111', 0]);
} catch(e) { console.println(e); }};
---
Annotation 9 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_136', 0]);
} catch(e) { console.println(e); }};
---
Annotation 10 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_160', 0]);
} catch(e) { console.println(e); }};
---
Annotation 11 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_197', 0]);
} catch(e) { console.println(e); }};
---
Annotation 12 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_179', 0]);
} catch(e) { console.println(e); }};
---
Annotation 13 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_201', 0]);
} catch(e) { console.println(e); }};
---
Annotation 14 - contains test string - 
---
if (this.hostContainer) { try {
this.hostContainer.postMessage(['newPage', 'pp_223', 0]);
} catch(e) { console.println(e); }};
---

Page 2
No annotations.

Page 3
No annotations.