我正忙着为一个关于度假屋的网站实施Solr。该网站使用Postgres作为主要数据库。对于搜索结果,我们希望使用Solr作为后端来获取可用的度假屋。
通过将DataImportHandler与JdbcDataSource一起使用来导入数据库的各个部分。
DataImportHandler的缩短版本:
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource type="JdbcDataSource" driver="org.postgresql.Driver" name="Solr"
url="jdbc:postgresql://host:port/database" user="*" password="*" readOnly="true"/>
<document>
<entity name="availabilities" transformer="RegexTransformer" pk="id"
query="
SELECT concat('A',pa.availability_id,'-',pad.start_date,'-',pad.period_type_id) as unique_availability_id,
pa.property_id,
NULLIF(CONCAT(ST_X(pl.position),',',ST_Y(pl.position)),',') as locationhash,
pl.position_accurate,
true as is_availability,
region.child_id as city_id,
region.ancestor_id as province_id,
(
SELECT array_to_string(array(SELECT binnen.ancestor_id
FROM fewo_Location_Ancestry binnen
WHERE binnen.child_id = region.child_id
AND binnen.ancestor_type_id = 12), ',')
) AS region_id,
pl.country_id,
pl.min_persons,
pl.max_persons,
fap.bedrooms,
pl.specifications,
pl.property_state_id,
pa.availability_id,
pad.period_type_id,
pad.start_date,
pad.end_date,
(
SELECT COUNT(*) > 0 FROM fewo_last_minute_details flmd
WHERE flmd.property_id = pa.property_id
AND flmd.details_id = pad.details_id
LIMIT 1
) AS last_minute,
CASE (
SELECT COUNT(*) > 0 FROM fewo_last_minute_details flmd
WHERE flmd.property_id = pa.property_id
AND flmd.details_id = pad.details_id
LIMIT 1
) WHEN true THEN pad.discount_price
ELSE pad.price
END as price,
pl.positioning_fee,
pl.sort_order
FROM fewo_property_availability_details pad
INNER JOIN fewo_property_availability pa USING (availability_id)
INNER JOIN fewo_Property_Location pl ON pa.property_id=pl.property_id
INNER JOIN fewo_all_properties fap ON pl.property_id=fap.property_id
INNER JOIN fewo_Location_Ancestry region ON (region.child_id =pl.location_id AND region.ancestor_type_id = 7)
WHERE pad.start_date > current_date
">
<field name="id" column="unique_availability_id"/>
<field name="property_id" column="property_id"/>
<field name="parent_id" column="property_id"/>
<field name="is_availability" column="is_availability"/>
<field name="positionCoord" column="locationhash"/>
<field name="position_accurate" column="position_accurate"/>
<field name="city_id" column="city_id"/>
<field name="province_id" column="province_id"/>
<field name="region_id" column="region_id" splitBy="," sourceColName="region_id"/>
<field name="country_id" column="country_id"/>
<field name="min_persons" column="min_persons"/>
<field name="max_persons" column="max_persons"/>
<field name="bedrooms" column="bedrooms"/>
<entity name="fewo_all_property_specifications" transformer="foo.SpecTransformer" pk="property_id"
cacheKey="property_id"
cacheLookup="availabilities.property_id"
query="SELECT property_id, specification_id, COALESCE(value,'true') as val FROM fewo_all_property_specifications"
processor="CachedSqlEntityProcessor">
</entity>
<field name="property_state_id" column="property_state_id"/>
<field name="availability_id" column="availability_id"/>
<field name="period_type_id" column="period_type_id"/>
<field name="start_date" column="start_date"/>
<field name="end_date" column="end_date"/>
<field name="last_minute" column="last_minute" />
<field name="price" column="price"/>
<field name="positioning_fee" column="positioning_fee"/>
<field name="sort_order" column="sort_order"/>
</entity>
</document>
</dataConfig>
导入大约需要一个小时,可以将1300万条记录导入Solr。问题是在导入期间无法更新表fewo_property_availability_details,因为有一个AccessShareLock锁定表。这可以防止将数据更新/插入表中,并且这些查询会排队。过了一会儿,他们堆得太多,数据库就失败了。
我的问题是:是否有一种很好的方法来导入数据而不会过多地妨碍常规查询?就像在x个导入的记录之后开始一个新的事务也让其他查询有时间运行?
我正在使用在Ubuntu 12.04上运行的Solr 4.0和Postgres 9.1。
由于
答案 0 :(得分:2)
AccessShareLock
仅与AccessExclusiveLock
冲突
ACCESS EXCLUSIVE
仅由ALTER TABLE
,DROP TABLE
,TRUNCATE
,REINDEX
,CLUSTER
,VACUUM FULL
和不合格{{}}获得{1}}陈述。
查看LOCK TABLE
,了解您是否可以获得有关锁定关系的更多信息。您还可以在PostgreSQL wiki上找到一些有用的锁定查询:the documentation
完全有可能的是,您的数据库无法处理繁重的并发读/写负载,而不是锁定保持状态。如果你的缓存很小,你在没有BBU raid控制器的普通(非SSD)磁盘上运行,和/或你没有为你的环境调整PostgreSQL配置,那么这种情况尤其可能。
答案 1 :(得分:0)
或者,您可以根据从DIH中选择的方式创建物化视图(在oracle和MySQL中可以),并且可以将刷新选项设置为FAST(这意味着视图将始终包含新数据)。你将实现的目标: - 更快的导入 - 桌子上没有锁 之后,您可以执行部分导入(非完全导入)以仅获取新数据或已更改的数据。见link 希望这会有所帮助。