我有一张46230行的表格(每个句子有1个文档,4623个句子和10个块):
SELECT
a.sentenceid,
b.sentenceid, a.chunkid,
Length(Replace(Cast(a.chunk & b.chunk AS TEXT), '0', ''))::float / Length(a.chunk)::float
FROM chunks2 a
INNER JOIN chunks2 b
ON a.sentenceid < b.sentenceid and a.chunkid = b.chunkid;
使用查询我想将句子块与具有相同块的其他句子块进行比较:
Hash Join (cost=1335.17..4549476.28 rows=71249559 width=26) (actual time=140.532..1160629.611 rows=106837530 loops=1)
Hash Cond: (a.chunkid = b.chunkid)
Join Filter: (a.sentenceid < b.sentenceid)
Rows Removed by Join Filter: 106883760
-> Seq Scan on chunks2 a (cost=0.00..757.30 rows=46230 width=15) (actual time=0.043..76.936 rows=46230 loops=1)
-> Hash (cost=757.30..757.30 rows=46230 width=15) (actual time=140.056..140.056 rows=46230 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 2680kB
-> Seq Scan on chunks2 b (cost=0.00..757.30 rows=46230 width=15) (actual time=0.032..65.781 rows=46230 loops=1)
Planning time: 0.518 ms
Execution time: 1217920.271 ms
我对unindexed表,复合索引进行了解释分析,并且两者都单独编制索引:
没有索引:
Hash Join (cost=1335.17..4549476.28 rows=71249559 width=26) (actual time=143.719..1155138.691 rows=106837530 loops=1)
Hash Cond: (a.chunkid = b.chunkid)
Join Filter: (a.sentenceid < b.sentenceid)
Rows Removed by Join Filter: 106883760
-> Seq Scan on chunks2 a (cost=0.00..757.30 rows=46230 width=15) (actual time=0.038..74.031 rows=46230 loops=1)
-> Hash (cost=757.30..757.30 rows=46230 width=15) (actual time=142.160..142.160 rows=46230 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 2680kB
-> Seq Scan on chunks2 b (cost=0.00..757.30 rows=46230 width=15) (actual time=0.031..63.628 rows=46230 loops=1)
Planning time: 1.664 ms
Execution time: 1213844.696 ms
索引(sentenceid)&amp;指数(chunkind):
Hash Join (cost=1335.17..4549476.28 rows=71249559 width=26) (actual time=144.376..1156178.110 rows=106837530 loops=1)
Hash Cond: (a.chunkid = b.chunkid)
Join Filter: (a.sentenceid < b.sentenceid)
Rows Removed by Join Filter: 106883760
-> Seq Scan on chunks2 a (cost=0.00..757.30 rows=46230 width=15) (actual time=0.039..77.275 rows=46230 loops=1)
-> Hash (cost=757.30..757.30 rows=46230 width=15) (actual time=142.954..142.954 rows=46230 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 2680kB
-> Seq Scan on chunks2 b (cost=0.00..757.30 rows=46230 width=15) (actual time=0.031..64.340 rows=46230 loops=1)
Planning time: 1.209 ms
Execution time: 1212779.012 ms
索引(sentenceid,chunkid):
Find-InTextFile -FilePath C:\MyFile.txt -Find '"a","a"' -Replace 'on'
function Find-InTextFile
{
<#
.SYNOPSIS
Performs a find (or replace) on a string in a text file or files.
.EXAMPLE
PS> Find-InTextFile -FilePath 'C:\MyFile.txt' -Find 'water' -Replace 'wine'
Replaces all instances of the string 'water' into the string 'wine' in
'C:\MyFile.txt'.
.EXAMPLE
PS> Find-InTextFile -FilePath 'C:\MyFile.txt' -Find 'water'
Finds all instances of the string 'water' in the file 'C:\MyFile.txt'.
.PARAMETER FilePath
The file path of the text file you'd like to perform a find/replace on.
.PARAMETER Find
The string you'd like to replace.
.PARAMETER Replace
The string you'd like to replace your 'Find' string with.
.PARAMETER UseRegex
Use this switch parameter if you're finding strings using regex else the Find string will
be escaped from regex characters
.PARAMETER NewFilePath
If a new file with the replaced the string needs to be created instead of replacing
the contents of the existing file use this param to create a new file.
.PARAMETER Force
If the NewFilePath param is used using this param will overwrite any file that
exists in NewFilePath.
#>
[CmdletBinding(DefaultParameterSetName = 'NewFile')]
param (
[Parameter(Mandatory = $true)]
[ValidateScript({ Test-Path -Path $_ -PathType 'Leaf' })]
[string[]]$FilePath,
[Parameter(Mandatory = $true)]
[string]$Find,
[Parameter()]
[string]$Replace,
[Parameter()]
[switch]$UseRegex,
[Parameter(ParameterSetName = 'NewFile')]
[ValidateScript({ Test-Path -Path ($_ | Split-Path -Parent) -PathType 'Container' })]
[string]$NewFilePath,
[Parameter(ParameterSetName = 'NewFile')]
[switch]$Force
)
begin
{
$SystemTempFolderPath = Get-SystemTempFolderPath
if (!$UseRegex.IsPresent)
{
$Find = [regex]::Escape($Find)
}
}
process
{
try
{
Write-Log -Message "$($MyInvocation.MyCommand) - BEGIN"
foreach ($File in $FilePath)
{
if ($Replace)
{
if ($NewFilePath)
{
if ((Test-Path -Path $NewFilePath -PathType 'Leaf') -and $Force.IsPresent)
{
Remove-Item -Path $NewFilePath -Force
(Get-Content $File) -replace $Find, $Replace | Add-Content -Path $NewFilePath -Force
}
elseif ((Test-Path -Path $NewFilePath -PathType 'Leaf') -and !$Force.IsPresent)
{
Write-Warning "The file at '$NewFilePath' already exists and the -Force param was not used"
}
else
{
(Get-Content $File) -replace $Find, $Replace | Add-Content -Path $NewFilePath -Force
}
}
else
{
(Get-Content $File) -replace $Find, $Replace | Add-Content -Path "$File.tmp" -Force
Remove-Item -Path $File
Rename-Item -Path "$File.tmp" -NewName $File
}
}
else
{
Select-String -Path $File -Pattern $Find
}
}
Write-Log -Message "$($MyInvocation.MyCommand) - END"
}
catch
{
Write-Log -Message "Error: $($_.Exception.Message) - Line Number: $($_.InvocationInfo.ScriptLineNumber)" -LogLevel '3'
Write-Log -Message "$($MyInvocation.MyCommand) - END"
$false
}
}
}
我知道他们有相同的操作,没有使用索引。我的错误在哪里以及如何使用索引加快查询速度?或者如何在我的案例中有效地使用索引?
答案 0 :(得分:0)
您没有包含逐字创建索引语句,它们会非常有用。
第一件事:一般规则是index for equality first, then for ranges.
因此,如果您要查询chunkid equal和sentenceid less,那么您应该创建索引:
create index chunks2_chunkid_idx on chunks2 (chunkid, sentenceid);
第二:你正在加入整张桌子。这将永远不会是一个廉价的操作,postgres计算出来并完全跳过索引使用。如果您的查询仅触及表的一小部分,则索引将有用。
我猜你正试图找到类似的句子,但我认为你正在接近这不是最好的方式。