Question

我是Geotools的新手并面临这个问题：我在PostGis中注入大约2MB的shapefile信息（约5800个条目），令人惊讶的是它需要大约6分钟才能完成！非常烦人，因为我的“真实”数据集可能由shapefile组（shp，dbf ...）高达25MB，需要100个组。

我被告知这可能是一个索引问题，因为Postgre会在每个INSERT上更新表的索引。有没有办法在我的批量INSERT期间“禁用”这些索引并告诉数据库在最后创建所有索引？或者有更好的方法吗？

这是我的代码段：

Map<String, Object> shpparams = new HashMap<String, Object>();
shpparams.put("url", "file://" + path);
FileDataStore shpStore = (FileDataStore) shpFactory.createDataStore(shpparams);
SimpleFeatureCollection features = shpStore.getFeatureSource().getFeatures();
if (schema == null) {
    // Copy schema and change name in order to refer to the same
    // global schema for all files
    SimpleFeatureType originalSchema = shpStore.getSchema();
    Name originalName = originalSchema.getName();
    NameImpl theName = new NameImpl(originalName.getNamespaceURI(), originalName.getSeparator(), POSTGIS_TABLENAME);
    schema = factory.createSimpleFeatureType(theName, originalSchema.getAttributeDescriptors(), originalSchema.getGeometryDescriptor(),
            originalSchema.isAbstract(), originalSchema.getRestrictions(), originalSchema.getSuper(), originalSchema.getDescription());
    pgStore.createSchema(schema);
}
// String typeName = shpStore.getTypeNames()[0];
SimpleFeatureStore featureStore = (SimpleFeatureStore) pgStore.getFeatureSource(POSTGIS_TABLENAME);

// Ajout des objets du shapefile dans la table PostGIS
DefaultTransaction transaction = new DefaultTransaction("create");
featureStore.setTransaction(transaction);
try {
    featureStore.addFeatures(features);
    transaction.commit();
} catch (Exception problem) {
    LOGGER.error(problem.getMessage(), problem);
    transaction.rollback();
} finally {
    transaction.close();
}
shpStore.dispose();

感谢您的帮助！

所以我测试了你的解决方案，但没有任何帮助我...完成时间仍然是相同的。这是我的表定义：

fid serial 10
the_geom geometry 2147483647
xxx varchar 10
xxx int4 10
xxx varchar 3
xxx varchar 2
xxx float8 17
xxx float8 17
xxx float8 17

所以我不认为问题与我的代码或数据库直接相关，可能是由于系统限制（RAM，缓冲区......）。我会在接下来的几天看看这个。

你有更多想法吗？

Answer 1

我回来了解这个问题的解决方案。经过多次调查，我发现物理网络是个问题：使用本地数据库（getoools本地应用程序）没有问题。网络为每个INSERT语句请求添加了200或300毫秒。随着DB中注入大量数据，响应时间非常长！

所以orignal Postgis配置或我的代码片段没问题......

谢谢大家的参与。

Answer 2

您可以通过以下步骤检查数据库中的索引或PK / FK约束是否真的成为瓶颈：

1）确保将数据插入单个事务中（禁用自动提交）

2）删除所有索引并在导入数据后重新创建它们（你不能禁用索引）

DROP INDEX my_index;
CREATE INDEX my_index ON my_table (my_column);

3）删除或禁用PK / FK约束，并在数据导入后重新创建或重新启用它们。您可以在数据导入期间跳过检查PK / FK约束，而不必使用

删除它们

ALTER TABLE my_table DISABLE trigger ALL;
-- data import
ALTER TABLE my_table ENABLE trigger ALL;

此方法的缺点是，对于在禁用检查时插入/更新的数据，未检查PK / FK 约束。当然，在数据导入后重新创建PK / FK约束时，也会强制执行PK / FK约束。

您还可以将PK / FK约束的检查推迟到事务结束。当且仅当PK / FK约束被定义为 deferrable （不是默认值）时，这是可能的：

ALTER TABLE my_table ADD PRIMARY KEY (id) DEFERRABLE INITIALLY DEFERRED;

START TRANSACTION;
-- data import
COMMIT; -- constraints are checked here

或

ALTER TABLE my_table ADD PRIMARY KEY (id) DEFERRABLE INITIALLY IMMEDIATE;

START TRANSACTION;
SET CONSTRAINTS ALL DEFERRED;
-- data import
COMMIT; -- constraints are checked here

修改

要缩小问题原因，可以使用应用程序导入数据，进行数据库转储（使用insert语句）并再次导入该数据库转储。这可以让您了解普通导入需要多长时间以及应用程序的开销。

使用INSERT语句创建数据库的仅数据转储（COPY语句会更快，但您的应用程序也使用插入，这样可以更好地进行比较）：

pg_dump <database> --data-only --column-inserts -f data.sql

再次创建空数据库模式并导入数据（基本时间）：

date; psql <database> --single-transaction -f data.sql > /dev/null; date

也许您可以通过此方式更深入地了解问题。

索引和性能

2 个答案: