由于速度的原因,我想到了一个完全用MySQL编写的错误的单词过滤器,但在我的搜索中我只找到了MySQL替换函数。
REPLACE(string_column, 'search', 'replace')
但是这个功能我只能逐字替换。在MySQL中是否有一个String函数可以检查整个字符串并替换和搜索表中的多个值? (使用php我完全清楚如何完成这项简单的任务)
MySQL循环是一个合理的解决方案吗?
我对每一个提示感到高兴。
答案 0 :(得分:3)
我将其作为新答案发布,因为我在这里使用了不同的技巧。我以为我们可以使用MySQL函数和BEFORE INSERT触发器。拆分字符串的功能是from this other answer。
CREATE FUNCTION strSplit(x VARCHAR(1000), delim VARCHAR(12), pos INTEGER)
RETURNS VARCHAR(1000)
BEGIN
DECLARE output VARCHAR(1000);
SET output = REPLACE(SUBSTRING(SUBSTRING_INDEX(x, delim, pos)
, CHAR_LENGTH(SUBSTRING_INDEX(x, delim, pos - 1)) + 1)
, delim
, '');
IF output = '' THEN SET output = null; END IF;
RETURN output;
END
并且INSERT触发器将是这样的:
CREATE TRIGGER change_words
BEFORE INSERT ON sentences
FOR EACH ROW
BEGIN
DECLARE i INT;
DECLARE s VARCHAR(1000);
DECLARE r VARCHAR(1000);
SET i = 1;
SET s = '';
REPEAT
SET s = (
SELECT
REPLACE(split, COALESCE(bad, ''), good)
FROM
(SELECT strSplit(new.sentence, ' ', i) AS split) s
LEFT JOIN words w ON s.split = w.bad
LIMIT 1
);
SET r = CONCAT_WS(' ', r, s);
SET i = i + 1;
UNTIL s IS NULL
END REPEAT;
SET new.sentence = r;
END
这会更快,因为当你将它插入数据库时,句子只会被转换一次。我们仍需要一些改进,与以前一样:
LEFT JOIN words w ON s.split = w.bad
它不会匹配包含分隔符的单词。 ! ?和替换功能
REPLACE(split, COALESCE(bad, ''), good)
将区分大小写。如果您愿意,可以很容易地修复它。请查看小提琴here。
答案 1 :(得分:1)
我认为最好在这里使用PHP,遗憾的是MySQL不支持使用正则表达式进行替换。
我正在回答,因为在你的评论中你说你想从MySQL学到一些东西,但我不建议使用这个解决方案,除非你别无选择:)
首先我们可以split our sentence into rows。我们需要一个包含数字序列1,2,3,......等的数字表。
CREATE TABLE numbers (n INT PRIMARY KEY);
INSERT INTO numbers VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
SELECT
s.id,
n.n,
SUBSTRING_INDEX(SUBSTRING_INDEX(s.sentence, ' ', n.n), ' ', -1) word
FROM
numbers n INNER JOIN sentences s
ON CHAR_LENGTH(s.sentence)
-CHAR_LENGTH(REPLACE(s.sentence, ' ', ''))>=n.n-1
ORDER BY
s.id, n.n
然后我们可以将此查询加入到包含必须替换的错误单词的表中:
SELECT
id,
n,
REPLACE(word, COALESCE(bad, ''), good) AS new_word
FROM (
SELECT
s.id,
n.n,
SUBSTRING_INDEX(SUBSTRING_INDEX(s.sentence, ' ', n.n), ' ', -1) word
FROM
numbers n INNER JOIN sentences s
ON CHAR_LENGTH(s.sentence)
-CHAR_LENGTH(REPLACE(s.sentence, ' ', ''))>=n.n-1
ORDER BY
s.id, n.n
) w LEFT JOIN words
ON w.word = bad
注意LEFT JOIN
和COALESCE(..., '')
- 最后使用GROUP BY
和GROUP_CONCAT
,您可以回复字符串:
SELECT
id,
GROUP_CONCAT(new_word ORDER BY n SEPARATOR ' ') AS new_sentence
FROM (
SELECT
id,
n,
REPLACE(word, COALESCE(bad, ''), good) AS new_word
FROM (
SELECT
s.id,
n.n,
SUBSTRING_INDEX(SUBSTRING_INDEX(s.sentence, ' ', n.n), ' ', -1) word
FROM
numbers n INNER JOIN sentences s
ON CHAR_LENGTH(s.sentence)
-CHAR_LENGTH(REPLACE(s.sentence, ' ', ''))>=n.n-1
ORDER BY
s.id, n.n
) w LEFT JOIN words
ON w.word = bad
) s
请查看它有效here。我不建议你使用这个解决方案,它不会非常高效,它更像是一个“黑客”,而不是一个真正的解决方案,更好地在这里使用PHP,但我希望你能从这个答案中学到新东西:)< / p>
可以做出一些简单的改进:
ON w.word = bad
这将仅匹配完全相同的单词(可能不区分大小写,但它取决于表定义的方式),并且它不支持像,...这样的分隔符。 ! ?等
而且:
REPLACE(word, COALESCE(bad, ''), good) AS new_word
将区分大小写。可以改进,但我建议你在PHP中做这些改进:)
答案 2 :(得分:1)
由于这是关于学习,并且仅在MYSQL中进行(而不是实现PHP更好的方式来实现它,并且男孩有更好的方法来实现它),我提出了一些东西来学习一些mysql技术。
关于唯一对我而言非常有趣的内容是set @sql1
行。也许可以写几个大段落。现在,我只是介绍它。
-- drop table badGood;
create table badGood
( -- maps bad words to good
id int auto_increment primary key,
bad varchar(100) not null,
good varchar(100) not null,
dtAdded datetime not null
);
-- truncate table badGood;
insert badGood(bad,good,dtAdded) values ('god','gosh',now()),('rumpus','rear section',now());
-- drop table posts;
create table posts
( postId int auto_increment primary key,
orig varchar(1000) not null,
cleanified varchar(1000) not null,
dtAdded datetime not null, -- when it was inserted into system, ready for cleaning
dtCleaned datetime null, -- when it was cleaned
isViewable int not null -- or bool, whatever. 0=No (not ready yet), 1=Yes (clean)
);
-- truncate table posts;
-- drop table xxx;
create table xxx
( -- this table will contain one row for every word passed to stored proc,
-- ordered by word sequence left to right in sentence
-- order by meaning column "id" (auto_inc). Note, there is no guarantee, in fact expect it not to happen,
-- that for any given postId, that the id's will be consecutive, but they will be in order
--
-- Reason being, multiple concurrent access of posts coming index
--
-- Decided against making this a temp table inside stored proc, but it was considered
id int auto_increment primary key,
postId int not null, -- avoid FK for now due to speed
word varchar(50) not null, -- word as presented by poster guy
word2 varchar(50) null, -- a more rated-G version of the word that is substituted
isDirty int not null, -- or bool, whatever. 0=clean, 1=naughty
key(postId)
);
-- truncate table xxx;
DROP PROCEDURE IF EXISTS cleanAndInsert;
delimiter $$
CREATE PROCEDURE cleanAndInsert
( suspectTxt varchar(255) # this text is suspect. Might contain non G-rated words
# other parameters too probably
)
BEGIN
declare insertedId int; -- this will house the PK value of the postId
insert posts(orig,cleanified,dtAdded,dtCleaned,isViewable) values (suspectTxt,'',now(),null,0); -- insert the passed string
set @insertedId:=LAST_INSERT_ID(); # now we have the PK id just inserted
-- the concat routine below is VERY FRAGILE to write, so as the sql string is slowly tweaked into perfection, with one working at that moment
-- I rem it out and create a new version under it, so the slightest error does not set me back 10 minutes (at least)
-- SET @sql1 = CONCAT("INSERT INTO xxx (word) VALUES ('",REPLACE((SELECT GROUP_CONCAT(orig) AS colx FROM posts where id=1), " ", "',null,0),('"),"');");
-- SET @sql1 = CONCAT("INSERT INTO xxx (postId,word) VALUES (",@insertedId,",'",REPLACE((SELECT GROUP_CONCAT(orig) AS colx posts where postId=@insertedId), " ", "',null,0),('"),"',null,0);");
SET @sql1 = CONCAT("INSERT INTO xxx (postId,word,word2,isDirty) VALUES (",@insertedId,",'",REPLACE((SELECT GROUP_CONCAT(orig) as colx FROM posts where postId=@insertedId), " ", "',null,0),(¿^?fish╔&®,'"),"',null,0);");
-- select @sql1; -- debugging purposes, rem'd out
-- Ideally @insertedId is inserted in the SET @sql1 line a few above, and NOT with the fish hard-coded bizareness, but it was too fragile
-- and time consuming. So this is an ugly hack and nothing to be proud of. So fixing it is a "TO DO"
set @sql2=replace(@sql1,'¿^?fish╔&®',@insertedId); -- This is the insert statement to run to blast out the words
-- select @sql2; -- debugging purposes, rem'd out.
PREPARE stmt FROM @sql2; -- you now have a prepared stmt string to execute (which inserts words into table xxx)
EXECUTE stmt;
-- now the clean word section
update xxx x
join badGood bg
on bg.bad=x.word
set x.isDirty=1,x.word2=bg.good
where postId=@insertedId;
-- I know, this is lame, but it allows us to use word2 simply as the final word and simplify our UPDATE posts after this block
update xxx
set word2=word
where postId=@insertedId and isDirty=0;
-- now the update section, to save the much cleaner string out to the posts table
update posts
set cleanified=
( select group_concat(word2 ORDER BY id SEPARATOR ' ') as xyz
from xxx where postId=@insertedId
), isViewable=1, dtCleaned=now()
where postId=@insertedId;
-- one could do a "delete from xxx where postId=@insertedId" if they wanted to. I kept it for debugging. Others delete the rows
select @insertedId as id; -- useful for calling routine, telling it the PK value
END
$$
in PHP, you would just call it with a normal query, starting with the $sql beginning with "call ..."
call cleanAndInsert('I type acceptable sentences'); -- returns 1 row, id is 1
call cleanAndInsert('Stan you can kiss my rumpus'); -- returns 1 row, id is 2
-- note this is very easy to trick, such as a naughty word not surrounded by whitespace, or broken out with spaces like "r u m p u s"
select * from posts order by postId desc;
+--------+-----------------------------+-----------------------------------+---------------------+---------------------+------------+
| postId | orig | cleanified | dtAdded | dtCleaned | isViewable |
+--------+-----------------------------+-----------------------------------+---------------------+---------------------+------------+
| 2 | Stan you can kiss my rumpus | Stan you can kiss my rear section | 2015-09-22 11:08:29 | 2015-09-22 11:08:29 | 1 |
| 1 | I type acceptable sentences | I type acceptable sentences | 2015-09-22 11:08:23 | 2015-09-22 11:08:23 | 1 |
+--------+-----------------------------+-----------------------------------+---------------------+---------------------+------------+
这是为了学习。把它当成它。