Tika,Maven,依赖项... Tika为什么使用EmptyParser?

时间:2018-10-15 08:55:11

标签: java maven apache-tika

我想将Tika用作Maven项目中的依赖项,以从文件中提取元数据。当我使用mvn exec:java而不是java -cp运行类时,它工作正常,所以我怀疑这是一个依赖问题...

我使用maven shade插件将jar中的所有依赖项包括在内,并在构建时将其包括在内。

pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>org.company.myapp</groupId>
  <artifactId>metadata-extractor</artifactId>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>

  <name>Metadata Extractor</name>
  <url>http://maven.apache.org</url>

  <properties>
    <tika.version>1.19</tika.version>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
    <!-- Tika -->
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>${tika.version}</version>
    </dependency>
  </dependencies>


    <build>
      <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.8.0</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
            </configuration>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-shade-plugin</artifactId>
          <version>3.2.0</version>
          <executions>
            <execution>
              <phase>package</phase>
              <goals>
                <goal>shade</goal>
              </goals>
              <configuration>
                <minimizeJar>true</minimizeJar>
                <filters>
                  <filter>
                    <artifact>*:*</artifact>
                    <excludes>
                      <exclude>META-INF/*.SF</exclude>
                      <exclude>META-INF/*.DSA</exclude>
                      <exclude>META-INF/*.RSA</exclude>
                    </excludes>
                  </filter>
                </filters>
              </configuration>
            </execution>
          </executions>
        </plugin>
      </plugins>
    </build>

</project>

主类:

public class App
{
    public static void main( String[] args )
    {
        // Get path
        Path path = Paths.get("/path/to/image.jpg");

        // Use Tika
        TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser(tikaConfig);
        ContentHandler handler = new BodyContentHandler(-1);

        try {
            TikaInputStream stream = TikaInputStream.get(path, metadata);
            parser.parse(stream, handler, metadata, new ParseContext());
        } catch (IOException | SAXException | TikaException e) {
            System.out.println("error: " + e.toString());
            return;
        }

        // Prints the metadata and content...
        System.out.println("Parsed Metadata: ");
        System.out.println(metadata);
        System.out.println("Parsed Text: ");
        System.out.println(handler.toString());

    }
}

结果为mvn exec:java(按预期工作):

Parsed Metadata: 
... X-Parsed-By=org.apache.tika.parser.DefaultParser X-Parsed-By=org.apache.tika.parser.jpeg.JpegParser ... other metadatas ... 
Parsed Text: 

但是,带有:

mvn clean package
java -cp target/metadata-extractor-1.0-SNAPSHOT.jar org.company.myapp.App

我知道了:

Parsed Metadata: 
X-Parsed-By=org.apache.tika.parser.EmptyParser resourceName=image.jpg Content-Length=1557172 Content-Type=image/jpeg
Parsed Text:

我在做什么错?我必须如何为其构建项目才能正确自动检测解析器?

谢谢。

1 个答案:

答案 0 :(得分:3)

您的类路径中没有解析器,因此选择了EmptyParser。我认为问题出在阴影插件中。删除此行:

<minimizeJar>true</minimizeJar>

并使用适当的版本添加这些依赖项:

 <dependency>
     <groupId>org.apache.pdfbox</groupId>
     <artifactId>jbig2-imageio</artifactId>
 </dependency>
 <dependency>
     <groupId>com.github.jai-imageio</groupId>
     <artifactId>jai-imageio-core</artifactId>
 </dependency>
 <dependency>
     <groupId>com.github.jai-imageio</groupId>
     <artifactId>jai-imageio-jpeg2000</artifactId>
 </dependency>