使用StuartMacKay的transform-swf库从swf中读取文本

时间:2013-08-26 10:57:14

标签: java text flash text-extraction

我需要从一些swf文件中提取所有文本。我正在使用Java,因为我有很多用这种语言开发的模块。 因此,我通过Web搜索了所有用于处理SWF文件的免费Java库。 最后,我找到了由 StuartMacKay 开发的库。可以在GitHub上找到名为 transform-swf 的库,方法是点击here

问题是:一旦我从GlyphIndex中提取TextSpan es,我该如何转换字符中的glyps?

请提供完整的工作和测试示例。不会接受任何理论答案,也不会回答“不能做”,“不可能”等等。

我所知道的以及我做了什么 我知道GlyphIndex es是使用TextTable构建的,DefineFont2是通过重复表示字体大小的整数和//Creating a Movie object from an swf file. Movie movie = new Movie(); movie.decodeFromFile(new File(out)); //Saving all the decoded DefineFont2 objects. Map<Integer,DefineFont2> fonts = new HashMap<>(); for (MovieTag object : list) { if (object instanceof DefineFont2) { DefineFont2 df2 = (DefineFont2) object; fonts.put(df2.getIdentifier(), df2); } } //Now I retrieve all the texts for (MovieTag object : list) { if (object instanceof DefineText2) { DefineText2 dt2 = (DefineText2) object; for (TextSpan ts : dt2.getSpans()) { Integer fontIdentifier = ts.getIdentifier(); if (fontIdentifier != null) { int fontSize = ts.getHeight(); // Here I try to create an object that should // reverse the process done by a TextTable ReverseTextTable rtt = new ReverseTextTable(fonts.get(fontIdentifier), fontSize); System.out.println(rtt.charactersForText(ts.getCharacters())); } } } } 对象提供的字体描述构建的,但是当我解码所有的DefineFont2,都有一个零长度提前。

以下是我的所作所为。

ReverseTextTable

课程public final class ReverseTextTable { private final transient Map<Character, GlyphIndex> characters; private final transient Map<GlyphIndex, Character> glyphs; public ReverseTextTable(final DefineFont2 font, final int fontSize) { characters = new LinkedHashMap<>(); glyphs = new LinkedHashMap<>(); final List<Integer> codes = font.getCodes(); final List<Integer> advances = font.getAdvances(); final float scale = fontSize / EMSQUARE; final int count = codes.size(); for (int i = 0; i < count; i++) { characters.put((char) codes.get(i).intValue(), new GlyphIndex(i, (int) (advances.get(i) * scale))); glyphs.put(new GlyphIndex(i, (int) (advances.get(i) * scale)), (char) codes.get(i).intValue()); } } //This method should reverse from a list of GlyphIndexes to a String public String charactersForText(final List<GlyphIndex> list) { String text=""; for(GlyphIndex gi: list){ text+=glyphs.get(gi); } return text; } } 如下:

DefineFont2

很遗憾,ReverseTableText的预付款清单为空,然后ArrayIndexOutOfBoundException的构造函数获得{{1}}。

5 个答案:

答案 0 :(得分:1)

老实说,我不知道如何用Java做到这一点。我并没有声称这是不可能的,我也相信有办法做到这一点。但是,您说过有很多库可以做到这一点。您还建议了一个库,即swftools。因此,我建议重新访问该库以从Flash文件中提取文本。为此,您可以使用Runtime.exec()执行命令行来运行该库。

就个人而言,我更喜欢Apache Commons exec而不是JDK发布的标准库。好吧,让我告诉你应该怎么做。您应该使用的可执行文件是“ swfstrings.exe ”。假设它被放入“C:\”。假设在同一个文件夹中你可以找到一个flash文件,例如page.swf。然后,我尝试了以下代码(它工作正常):

    Path pathToSwfFile = Paths.get("C:\" + File.separator + "page.swf");
    CommandLine commandLine = CommandLine.parse("C:\" + File.separator + "swfstrings.exe");
    commandLine.addArgument("\"" + swfFile.toString() + "\"");
    DefaultExecutor executor = new DefaultExecutor();
    executor.setExitValues(new int[]{0, 1}); //Notice that swfstrings.exe returns 1 for success,
                                            //0 for file not found, -1 for error

    ByteArrayOutputStream stdout = new ByteArrayOutputStream();
    PumpStreamHandler psh = new PumpStreamHandler(stdout);
    executor.setStreamHandler(psh);
    int exitValue;
    try{
        exitValue = executor.execute(commandLine);
    }catch(org.apache.commons.exec.ExecuteException ex){
        psh.stop();
    }
    if(!executor.isFailure(exitValue)){
       String out = stdout.toString("UTF-8"); // here you have the extracted text
    }

我知道,这不是你要求的答案,但工作正常。

答案 1 :(得分:0)

它似乎很难实现你想要实现的目标,你试图编译文件bur我很遗憾地说它不可能,我建议你做的是将它转换成一些位图(如果可能的话)或者通过任何其他方法尝试使用OCR

读取字符

有一些software's可以做到这一点,你也可以查看一些forums。因为一旦编译swf版本非常困难(据我所知,这是不可能的)。如果您愿意,可以查看此decompiler或尝试使用其他语言,例如项目here

答案 2 :(得分:0)

使用 transform-swf 库时,我遇到了类似的长字符串问题。

获取源代码并进行调试。
我相信课程com.flagstone.transform.coder.SWFDecoder中有一个小错误。

第540行(适用于3.0.2版),更改

  

dest + = length;

  

dest + = count;

那应该为你做(它是关于提取字符串)。 我也通知斯图尔特。只有在字符串非常大的情况下才会出现问题。

答案 3 :(得分:0)

我现在正在尝试用Java反编译SWF,我在弄清楚如何对原始文本进行反向工程时遇到了这个问题。

在查看源代码后,我意识到它非常简单。每种字体都有一个指定的字符序列,可以通过调用DefineFont2.getCodes()来检索,而glyphIndex是DefineFont2.getCodes()中匹配字符的索引。

但是,如果单个SWF文件中使用了多种字体,则很难将每个DefineText与相应的DefineFont2匹配,因为没有标识DefineFont2的属性用于每个DefineText

要解决此问题,我想出了一种自学习算法,该算法会尝试为每个DefineFont2猜测正确的DefineText,从而正确地推导原始文本。

为了对原始文本进行反向工程,我创建了一个名为FontLearner的类:

public class FontLearner {

    private final ArrayList<DefineFont2> fonts = new ArrayList<DefineFont2>();
    private final HashMap<Integer, HashMap<Character, Integer>> advancesMap = new HashMap<Integer, HashMap<Character, Integer>>();

    /**
     * The same characters from the same font will have similar advance values.
     * This constant defines the allowed difference between two advance values
     * before they are treated as the same character
     */
    private static final int ADVANCE_THRESHOLD = 10;

    /**
     * Some characters have outlier advance values despite being compared
     * to the same character
     * This constant defines the minimum accuracy level for each String
     * before it is associated with the given font
     */
    private static final double ACCURACY_THRESHOLD = 0.9;

    /**
     * This method adds a DefineFont2 to the learner, and a DefineText
     * associated with the font to teach the learner about the given font.
     * 
     * @param font The font to add to the learner
     * @param text The text associated with the font
     */
    private void addFont(DefineFont2 font, DefineText text) {
        fonts.add(font);
        HashMap<Character, Integer> advances = new HashMap<Character, Integer>();
        advancesMap.put(font.getIdentifier(), advances);

        List<Integer> codes = font.getCodes();

        List<TextSpan> spans = text.getSpans();
        for (TextSpan span : spans) {
            List<GlyphIndex> characters = span.getCharacters();
            for (GlyphIndex character : characters) {
                int glyphIndex = character.getGlyphIndex();
                char c = (char) (int) codes.get(glyphIndex);

                int advance = character.getAdvance();
                advances.put(c, advance);
            }
        }
    }

    /**
     * 
     * @param text The DefineText to retrieve the original String from
     * @return The String retrieved from the given DefineText
     */
    public String getString(DefineText text) {
        StringBuilder sb = new StringBuilder();

        List<TextSpan> spans = text.getSpans();

        DefineFont2 font = null;
        for (DefineFont2 getFont : fonts) {
            List<Integer> codes = getFont.getCodes();
            HashMap<Character, Integer> advances = advancesMap.get(getFont.getIdentifier());
            if (advances == null) {
                advances = new HashMap<Character, Integer>();
                advancesMap.put(getFont.getIdentifier(), advances);
            }

            boolean notFound = true;
            int totalMisses = 0;
            int totalCount = 0;

            for (TextSpan span : spans) {
                List<GlyphIndex> characters = span.getCharacters();
                totalCount += characters.size();

                int misses = 0;
                for (GlyphIndex character : characters) {
                    int glyphIndex = character.getGlyphIndex();
                    if (codes.size() > glyphIndex) {
                        char c = (char) (int) codes.get(glyphIndex);

                        Integer getAdvance = advances.get(c);
                        if (getAdvance != null) {
                            notFound = false;

                            if (Math.abs(character.getAdvance() - getAdvance) > ADVANCE_THRESHOLD) {
                                misses += 1;
                            }
                        }
                    } else {
                        notFound = false;
                        misses = characters.size();

                        break;
                    }
                }

                totalMisses += misses;
            }

            double accuracy = (totalCount - totalMisses) * 1.0 / totalCount;

            if (accuracy > ACCURACY_THRESHOLD && !notFound) {
                font = getFont;

                // teach this DefineText to the FontLearner if there are
                // any new characters
                for (TextSpan span : spans) {
                    List<GlyphIndex> characters = span.getCharacters();
                    for (GlyphIndex character : characters) {
                        int glyphIndex = character.getGlyphIndex();
                        char c = (char) (int) codes.get(glyphIndex);

                        int advance = character.getAdvance();
                        if (advances.get(c) == null) {
                            advances.put(c, advance);
                        }
                    }
                }
                break;
            }
        }

        if (font != null) {
            List<Integer> codes = font.getCodes();

            for (TextSpan span : spans) {
                List<GlyphIndex> characters = span.getCharacters();
                for (GlyphIndex character : characters) {
                    int glyphIndex = character.getGlyphIndex();
                    char c = (char) (int) codes.get(glyphIndex);
                    sb.append(c);
                }
                sb = new StringBuilder(sb.toString().trim());
                sb.append(" ");
            }
        }

        return sb.toString().trim();
    }
}

用法:

Movie movie = new Movie();
movie.decodeFromStream(response.getEntity().getContent());

FontLearner learner = new FontLearner();
DefineFont2 font = null;

List<MovieTag> objects = movie.getObjects();
for (MovieTag object : objects) {
if (object instanceof DefineFont2) {
    font = (DefineFont2) object;
} else if (object instanceof DefineText) {
    DefineText text = (DefineText) object;
    if (font != null) {
        learner.addFont(font, text);
        font = null;
    }
    String line = learner.getString(text); // reverse engineers the line
}

我很高兴地说这种方法使用StuartMacKay的transform-swf库对原始字符串进行逆向工程的准确性达到了100%。

答案 4 :(得分:0)

我知道这不是您要的,但是最近我需要使用Java从SWF中提取文本,并且发现ffdec库比transform-swf更好

评论是否有人需要示例代码