Question

我在Java中有一个简单定义问题的解决方案，但我想改善执行数据处理所需的时间。问题是在关系数据库的列中保留一系列单词，并将单词拆分成对，然后插入到对的字典中。这些对本身与partid标识的产品有关。

因此Part表有

PartID (int), PartDesc (nvarchar)

，字典有

DictID (int), WordPair (nvarchar).

因此逻辑是：

insert into DictPair (wordpair, partid)
select wordpairs, partid from Part

wordpair被定义为两个相邻的单词，因此将重复单词，例如

red car with 4 wheel drive

将配对

{red, car},{car, with}, {with,4}, {4, wheel}, {wheel, drive}

因此，对于partid 45的最终字典将具有（partid，dictionarypair）：

45, red car
45, car with
45, with 4
45, 4 wheel
45, wheel drive

这用于产品分类，因此字序有关（但配对顺序并不重要）。

有没有人想过如何解决这个问题？我在考虑存储过程，并使用某种解析。出于效率原因，我希望整个解决方案在SQL中实现。

Answer 1

基本上，在网络上找到一个返回字符串中单词位置的split()函数。

然后做：

select s.word, lead(s.word) over (partition by p.partId order by s.pos) as nextword
from parts p outer apply
     dbo.split(p.partDesc, ' ') as s(word, pos);

这会将NULL用于最后一对，这似乎不是你想要的。所以：

insert into DictPair (wordpair, partid)
    select word + ' ' nextword, partid, 
    from (select p.*, s.word, lead(s.word) over (partition by p.partId order by s.pos) as nextword
          from parts p outer apply
               dbo.split(p.partDesc, ' ') as s(word, pos)
         )
    where nextword is not null;

Here是一些拆分函数，由Googling＆＃34; SQL Server split＆＃34;提供。并且another。来自StackOverflow。还有更多

从SQL Server中的列数据拆分对

1 个答案: