如何将大型JSON文件分割成块并使用GSON进行排序

时间:2018-03-14 09:14:53

标签: java json sorting parsing gson

我有一个名为something.json的巨大JSON文件。该文件是20 MB。我正在用GSON读这个。它正在标准的Android Nexus 5X上阅读。

Json的例子:

[
    {"country":"UA","name":"Hurzuf","_id":707860,"coord":{"lon":34.283333,"lat":44.549999}},
    {"country":"UA","name":"Il’ichëvka","_id":707716,"coord":{"lon":34.383331,"lat":44.666668}},
    {"country":"BG","name":"Rastnik","_id":727762,"coord":{"lon":25.283331,"lat":41.400002}}
...
]

代码:

@Override
protected ArrayList<City> doInBackground(File... files) {
    ArrayList<City> cities = new ArrayList<>();
    try {
        InputStream is = new FileInputStream(files[0]);
        JsonReader reader = new JsonReader(new InputStreamReader(is, "UTF-8"));
        reader.beginArray();
        while (reader.hasNext()) {
            City city = new Gson().fromJson(reader, City.class);
            cities.add(city);
        }
        reader.endArray();
        reader.close();
    } catch (Exception e) {
        mResult.onFinish(cities, e.getMessage());
    }

    Collections.sort(cities, (o1, o2) -> o1.getName().compareTo(o2.getName()));
    mResult.onFinish(cities, CityService.SUCCESS);
    return cities;
}

使用的图书馆:

com.google.code.gson:gson:2.8.0

需要使用 Android API 16 直到最新版本。

我需要将此内容读入mCities,并按城市名称的字母顺序对其进行排序。现在这需要3分钟,必须在大约10秒钟内完成。我的方法是将10个较小的块中的json文件剪切掉,读取它们,连接并对它们进行排序。

所以我的问题是:如何将文件分成较小的块,这是解决这个问题的正确方法吗?

链接到文件:http://www.jimclermonts.nl/docs/cities.json

1 个答案:

答案 0 :(得分:1)

我本身从不做Android编码,但是我有一些注意事项,可能还有你的想法,因为这是纯Java 。 您的读者在阅读每个元素时会做非常过度的工作。 首先,您不需要在每次需要时创建Gson

  • 它是不可变的和线程安全的。
  • 创建起来相对昂贵。
  • 实例化Gson实例也会在执行更多时间的情况下命中堆,然后进行垃圾收集。

接下来,Gson中只有反序列化和JSON流读取之间存在差异:第一种可能在引擎盖下使用重型适配器组合,而后者只能通过令牌解析JSON文档令牌。 话虽如此,您可以在阅读JSON流时获得更好的性能:您的JSON文件确实具有非常严格的结构,因此可以更简单地实现高级解析器。

假设一个简单的测试套件,为您的问题提供不同的实现:

数据对象

City.java

final class City {

    @SerializedName("_id")
    final int id;

    @SerializedName("country")
    final String country;

    @SerializedName("name")
    final String name;

    @SerializedName("coord")
    final Coordinates coordinates;

    private City(final int id, final String country, final String name, final Coordinates coordinates) {
        this.id = id;
        this.country = country;
        this.name = name;
        this.coordinates = coordinates;
    }

    static City of(final int id, final String country, final String name, final Coordinates coordinates) {
        return new City(id, country, name, coordinates);
    }

    @Override
    public boolean equals(final Object o) {
        if ( this == o ) {
            return true;
        }
        if ( o == null || getClass() != o.getClass() ) {
            return false;
        }
        final City that = (City) o;
        return id == that.id;
    }

    @Override
    public int hashCode() {
        return id;
    }

    @SuppressWarnings("ConstantConditions")
    public static int compareByName(final City city1, final City city2) {
        return city1.name.compareTo(city2.name);
    }

}

Coordinates.java

final class Coordinates {

    @SerializedName("lat")
    final double latitude;

    @SerializedName("lon")
    final double longitude;

    private Coordinates(final double latitude, final double longitude) {
        this.latitude = latitude;
        this.longitude = longitude;
    }

    static Coordinates of(final double latitude, final double longitude) {
        return new Coordinates(latitude, longitude);
    }

    @Override
    public boolean equals(final Object o) {
        if ( this == o ) {
            return true;
        }
        if ( o == null || getClass() != o.getClass() ) {
            return false;
        }
        final Coordinates that = (Coordinates) o;
        return Double.compare(that.latitude, latitude) == 0
                && Double.compare(that.longitude, longitude) == 0;
    }

    @Override
    public int hashCode() {
        final long latitudeBits = Double.doubleToLongBits(latitude);
        final long longitudeBits = Double.doubleToLongBits(longitude);
        final int latitudeHash = (int) (latitudeBits ^ latitudeBits >>> 32);
        final int longitudeHash = (int) (longitudeBits ^ longitudeBits >>> 32);
        return 31 * latitudeHash + longitudeHash;
    }

}

测试基础设施

ITest.java

interface ITest {

    @Nonnull
    default String getName() {
        return getClass().getSimpleName();
    }

    @Nonnull
    Collection<City> test(@Nonnull JsonReader jsonReader)
            throws IOException;

}

    public static void main(final String... args)
            throws IOException {
        final Iterable<ITest> tests = ImmutableList.of(
                FirstTest.get(),
                ReadAsWholeListTest.get(),
                ReadAsWholeTreeSetTest.get(),
                ReadJsonStreamIntoListTest.get(),
                ReadJsonStreamIntoTreeSetTest.get(),
                ReadJsonStreamIntoListChunksTest.get()
        );
        for ( int i = 0; i < 3; i++ ) {
            for ( final ITest test : tests ) {
                try ( final ZipInputStream zipInputStream = new ZipInputStream(Resources.getPackageResourceInputStream(Q49273660.class, "cities.json.zip")) ) {
                    for ( ZipEntry zipEntry = zipInputStream.getNextEntry(); zipEntry != null; zipEntry = zipInputStream.getNextEntry() ) {
                        if ( zipEntry.getName().equals("cities.json") ) {
                            final JsonReader jsonReader = new JsonReader(new InputStreamReader(zipInputStream)); // do not close
                            System.out.printf("%1$35s : ", test.getName());
                            final Stopwatch stopwatch = Stopwatch.createStarted();
                            final Collection<City> cities = test.test(jsonReader);
                            System.out.printf("in %d ms with %d elements\n", stopwatch.elapsed(TimeUnit.MILLISECONDS), cities.size());
                            assertSorted(cities, City::compareByName);
                        }
                    }
                }
            }
            System.out.println("--------------------");
        }
    }

    private static <E> void assertSorted(final Iterable<? extends E> iterable, final Comparator<? super E> comparator) {
        final Iterator<? extends E> iterator = iterable.iterator();
        if ( !iterator.hasNext() ) {
            return;
        }
        E a = iterator.next();
        if ( !iterator.hasNext() ) {
            return;
        }
        do {
            final E b = iterator.next();
            if ( comparator.compare(a, b) > 0 ) {
                throw new AssertionError(a + " " + b);
            }
            a = b;
        } while ( iterator.hasNext() );
    }

测试

FirstTest.java

这是最慢的一个。 而且它只是将您的问题改编为测试。

final class FirstTest
        implements ITest {

    private static final ITest instance = new FirstTest();

    private FirstTest() {
    }

    static ITest get() {
        return instance;
    }

    @Nonnull
    @Override
    public List<City> test(@Nonnull final JsonReader jsonReader)
            throws IOException {
        jsonReader.beginArray();
        final List<City> cities = new ArrayList<>();
        while ( jsonReader.hasNext() ) {
            final City city = new Gson().fromJson(jsonReader, City.class);
            cities.add(city);
        }
        jsonReader.endArray();
        cities.sort(City::compareByName);
        return cities;
    }

}

ReadAsWholeListTest.java

这很可能是你如何实现它的。 它不是赢家,但它是最简单的,它使用默认排序。

final class ReadAsWholeListTest
        implements ITest {

    private static final ITest instance = new ReadAsWholeListTest();

    private ReadAsWholeListTest() {
    }

    static ITest get() {
        return instance;
    }

    private static final Gson gson = new Gson();

    private static final Type citiesListType = new TypeToken<List<City>>() {
    }.getType();

    @Nonnull
    @Override
    public List<City> test(@Nonnull final JsonReader jsonReader) {
        final List<City> cities = gson.fromJson(jsonReader, citiesListType);
        cities.sort(City::compareByName);
        return cities;
    }

}

ReadAsWholeTreeSetTest.java

如果你没有绑定到列表,另一个想法是使用已经排序的集合,如TreeSet。 由于我不知道是否有方法在TreeSet中指定新的Gson比较器机制,因此它必须使用自定义类型的适配器工厂(但如果{{{} {} 1}}已经按名称进行比较,但它不灵活。)

City

JSON流阅读器测试

以下课程是一种特殊的读者测试,它使用简化的城市JSON阅读策略。

AbstractJsonStreamTest.java

它可能是最简单的(就JSON结构分析而言),它要求JSON文档非常严格。

final class ReadAsWholeTreeSetTest
        implements ITest {

    private static final ITest instance = new ReadAsWholeTreeSetTest();

    private ReadAsWholeTreeSetTest() {
    }

    static ITest get() {
        return instance;
    }

    @SuppressWarnings({ "rawtypes", "unchecked" })
    private static final TypeToken<TreeSet<?>> rawTreeSetType = (TypeToken) TypeToken.get(TreeSet.class);

    private static final Map<Type, Comparator<?>> comparatorsRegistry = ImmutableMap.of(
            City.class, (Comparator<City>) City::compareByName
    );

    private static final Gson gson = new GsonBuilder()
            .registerTypeAdapterFactory(new TypeAdapterFactory() {
                @Override
                public <T> TypeAdapter<T> create(final Gson gson, final TypeToken<T> typeToken) {
                    if ( !TreeSet.class.isAssignableFrom(typeToken.getRawType()) ) {
                        return null;
                    }
                    final Type elementType = ((ParameterizedType) typeToken.getType()).getActualTypeArguments()[0];
                    @SuppressWarnings({ "rawtypes", "unchecked" })
                    final Comparator<Object> comparator = (Comparator) comparatorsRegistry.get(elementType);
                    if ( comparator == null ) {
                        return null;
                    }
                    final TypeAdapter<TreeSet<?>> originalTreeSetTypeAdapter = gson.getDelegateAdapter(this, rawTreeSetType);
                    final TypeAdapter<?> originalElementTypeAdapter = gson.getDelegateAdapter(this, TypeToken.get(elementType));
                    final TypeAdapter<TreeSet<Object>> treeSetTypeAdapter = new TypeAdapter<TreeSet<Object>>() {
                        @Override
                        public void write(final JsonWriter jsonWriter, final TreeSet<Object> treeSet)
                                throws IOException {
                            originalTreeSetTypeAdapter.write(jsonWriter, treeSet);
                        }

                        @Override
                        public TreeSet<Object> read(final JsonReader jsonReader)
                                throws IOException {
                            jsonReader.beginArray();
                            final TreeSet<Object> elements = new TreeSet<>(comparator);
                            while ( jsonReader.hasNext() ) {
                                final Object element = originalElementTypeAdapter.read(jsonReader);
                                elements.add(element);
                            }
                            return elements;
                        }
                    }.nullSafe();
                    @SuppressWarnings({ "rawtypes", "unchecked" })
                    final TypeAdapter<T> castTreeSetTypeAdapter = (TypeAdapter<T>) treeSetTypeAdapter;
                    return castTreeSetTypeAdapter;
                }
            })
            .create();

    private static final Type citiesSetType = new TypeToken<TreeSet<City>>() {
    }.getType();

    @Nonnull
    @Override
    public Set<City> test(@Nonnull final JsonReader jsonReader) {
        return gson.fromJson(jsonReader, citiesSetType);
    }

}

ReadJsonStreamIntoListTest.java

这个与abstract class AbstractJsonStreamTest implements ITest { protected static void read(final JsonReader jsonReader, final Consumer<? super City> cityConsumer) throws IOException { jsonReader.beginArray(); while ( jsonReader.hasNext() ) { jsonReader.beginObject(); require(jsonReader, "country"); final String country = jsonReader.nextString(); require(jsonReader, "name"); final String name = jsonReader.nextString(); require(jsonReader, "_id"); final int id = jsonReader.nextInt(); require(jsonReader, "coord"); jsonReader.beginObject(); require(jsonReader, "lon"); final double longitude = jsonReader.nextDouble(); require(jsonReader, "lat"); final double latitude = jsonReader.nextDouble(); jsonReader.endObject(); jsonReader.endObject(); final City city = City.of(id, country, name, Coordinates.of(latitude, longitude)); cityConsumer.accept(city); } jsonReader.endArray(); } private static void require(final JsonReader jsonReader, final String expectedName) throws IOException { final String actualName = jsonReader.nextName(); if ( !actualName.equals(expectedName) ) { throw new JsonParseException("Expected " + expectedName + " but was " + actualName); } } } 非常相似,但它使用简化的反序列化机制。

ReadAsWholeListTest

ReadJsonStreamIntoTreeSetTest.java

这个与前一个一样,也是更昂贵的实现(final class ReadJsonStreamIntoListTest extends AbstractJsonStreamTest { private static final ITest instance = new ReadJsonStreamIntoListTest(); private ReadJsonStreamIntoListTest() { } static ITest get() { return instance; } @Nonnull @Override public Collection<City> test(@Nonnull final JsonReader jsonReader) throws IOException { final List<City> cities = new ArrayList<>(); read(jsonReader, cities::add); cities.sort(City::compareByName); return cities; } } )的另一个实现,但它不需要自定义类型的adatpter。

ReadAsWholeTreeSetTest

ReadJsonStreamIntoListChunksTest.java

以下测试基于您最初的想法,但它不会并行排序(我不确定,但您可以尝试一下)。 我仍然认为前两个更简单,可能更容易维护并提高性能。

final class ReadJsonStreamIntoTreeSetTest
        extends AbstractJsonStreamTest {

    private static final ITest instance = new ReadJsonStreamIntoTreeSetTest();

    private ReadJsonStreamIntoTreeSetTest() {
    }

    static ITest get() {
        return instance;
    }

    @Nonnull
    @Override
    public Collection<City> test(@Nonnull final JsonReader jsonReader)
            throws IOException {
        final Collection<City> cities = new TreeSet<>(City::compareByName);
        read(jsonReader, cities::add);
        return cities;
    }

}

测试结果

对于我的桌面 JRE,我可以获得以下测试结果:

final class ReadJsonStreamIntoListChunksTest
        extends AbstractJsonStreamTest {

    private static final ITest instance = new ReadJsonStreamIntoListChunksTest();

    private ReadJsonStreamIntoListChunksTest() {
    }

    static ITest get() {
        return instance;
    }

    @Nonnull
    @Override
    public List<City> test(@Nonnull final JsonReader jsonReader)
            throws IOException {
        final Collection<List<City>> cityChunks = new ArrayList<>();
        final AtomicReference<List<City>> cityChunkRef = new AtomicReference<>(new ArrayList<>());
        read(jsonReader, city -> {
            final List<City> cityChunk = cityChunkRef.get();
            cityChunk.add(city);
            if ( cityChunk.size() >= 10000 ) {
                cityChunks.add(cityChunk);
                cityChunkRef.set(new ArrayList<>());
            }
        });
        if ( !cityChunkRef.get().isEmpty() ) {
            cityChunks.add(cityChunkRef.get());
        }
        for ( final List<City> cities : cityChunks ) {
            Collections.sort(cities, City::compareByName);
        }
        return merge(cityChunks, City::compareByName);
    }

    /**
     * <p>Adapted from:</p>
     * <ul>
     * <li>Original question: https://stackoverflow.com/questions/1774256/java-code-review-merge-sorted-lists-into-a-single-sorted-list</li>
     * <li>Accepted answer: https://stackoverflow.com/questions/1774256/java-code-review-merge-sorted-lists-into-a-single-sorted-list/1775748#1775748</li>
     * </ul>
     */
    @SuppressWarnings("MethodCallInLoopCondition")
    private static <E> List<E> merge(final Iterable<? extends List<E>> lists, final Comparator<? super E> comparator) {
        int totalSize = 0;
        for ( final List<E> l : lists ) {
            totalSize += l.size();
        }
        final List<E> result = new ArrayList<>(totalSize);
        while ( result.size() < totalSize ) { // while we still have something to add
            List<E> lowest = null;
            for ( final List<E> l : lists ) {
                if ( !l.isEmpty() ) {
                    if ( lowest == null || comparator.compare(l.get(0), lowest.get(0)) <= 0 ) {
                        lowest = l;
                    }
                }
            }
            assert lowest != null;
            result.add(lowest.get(0));
            lowest.remove(0);
        }
        return result;
    }

}

正如您所看到的,创建过多的 FirstTest : in 5797 ms with 209557 elements ReadAsWholeListTest : in 796 ms with 209557 elements ReadAsWholeTreeSetTest : in 733 ms with 162006 elements ReadJsonStreamIntoListTest : in 461 ms with 209557 elements ReadJsonStreamIntoTreeSetTest : in 452 ms with 162006 elements ReadJsonStreamIntoListChunksTest : in 607 ms with 209557 elements -------------------- FirstTest : in 3396 ms with 209557 elements ReadAsWholeListTest : in 493 ms with 209557 elements ReadAsWholeTreeSetTest : in 520 ms with 162006 elements ReadJsonStreamIntoListTest : in 385 ms with 209557 elements ReadJsonStreamIntoTreeSetTest : in 377 ms with 162006 elements ReadJsonStreamIntoListChunksTest : in 540 ms with 209557 elements -------------------- FirstTest : in 3448 ms with 209557 elements ReadAsWholeListTest : in 429 ms with 209557 elements ReadAsWholeTreeSetTest : in 421 ms with 162006 elements ReadJsonStreamIntoListTest : in 400 ms with 209557 elements ReadJsonStreamIntoTreeSetTest : in 385 ms with 162006 elements ReadJsonStreamIntoListChunksTest : in 480 ms with 209557 elements -------------------- 实例绝对是错误的想法。 更优化的测试可获得更好的性能。 但是,将大型列表拆分为以后要合并的已排序块(无并行)并不会在我的环境中提供太多的性能提升。

为简单而且可能是最佳选择,我会根据所需的集合使用Gson。 我真的不确定它在真实的Android环境中有多好用,但你可以简单地做一些JSON反序列化比Gson使用它的内部结构好一些。

顺便说一下:

  • 我不确定,但你注意到了162006个独特的城市吗?您的JSON文件可能有一些重复项(至少如果其ReadJsonStreamInto_Collection_Test是标识)。
  • 如果您在Android设备上使用它之前只是在工作站上生成_id 的排序版,该怎么办?此外,如果我的上述假设正确,您可能希望过滤掉重复项。