在没有php的情况下在MYSQL中编写本机的坏词过滤器

时间:2015-09-21 22:41:33

标签: mysql

由于速度的原因,我想到了一个完全用MySQL编写的错误的单词过滤器,但在我的搜索中我只找到了MySQL替换函数。

REPLACE(string_column, 'search', 'replace')

但是这个功能我只能逐字替换。在MySQL中是否有一个String函数可以检查整个字符串并替换和搜索表中的多个值? (使用php我完全清楚如何完成这项简单的任务)

MySQL循环是一个合理的解决方案吗?

我对每一个提示感到高兴。

3 个答案:

答案 0 :(得分:3)

我将其作为新答案发布,因为我在这里使用了不同的技巧。我以为我们可以使用MySQL函数和BEFORE INSERT触发器。拆分字符串的功能是from this other answer

CREATE FUNCTION strSplit(x VARCHAR(1000), delim VARCHAR(12), pos INTEGER) 
RETURNS VARCHAR(1000)
BEGIN
  DECLARE output VARCHAR(1000);
  SET output = REPLACE(SUBSTRING(SUBSTRING_INDEX(x, delim, pos)
                 , CHAR_LENGTH(SUBSTRING_INDEX(x, delim, pos - 1)) + 1)
                 , delim
                 , '');
  IF output = '' THEN SET output = null; END IF;
  RETURN output;
END

并且INSERT触发器将是这样的:

CREATE TRIGGER change_words
BEFORE INSERT ON sentences
FOR EACH ROW
BEGIN
  DECLARE i INT;
  DECLARE s VARCHAR(1000);
  DECLARE r VARCHAR(1000);
  SET i = 1;
  SET s = '';
  REPEAT
    SET s = (
      SELECT
        REPLACE(split, COALESCE(bad, ''), good)
      FROM
        (SELECT strSplit(new.sentence, ' ', i) AS split) s
        LEFT JOIN words w ON s.split = w.bad
      LIMIT 1
      );
    SET r = CONCAT_WS(' ', r, s);
    SET i = i + 1;
    UNTIL s IS NULL
  END REPEAT;
  SET new.sentence = r;
END

这会更快,因为当你将它插入数据库时​​,句子只会被转换一次。我们仍需要一些改进,与以前一样:

LEFT JOIN words w ON s.split = w.bad

它不会匹配包含分隔符的单词。 ! ?和替换功能

REPLACE(split, COALESCE(bad, ''), good)

将区分大小写。如果您愿意,可以很容易地修复它。请查看小提琴here

答案 1 :(得分:1)

我认为最好在这里使用PHP,遗憾的是MySQL不支持使用正则表达式进行替换。

我正在回答,因为在你的评论中你说你想从MySQL学到一些东西,但我不建议使用这个解决方案,除非你别无选择:)

首先我们可以split our sentence into rows。我们需要一个包含数字序列1,2,3,......等的数字表。

CREATE TABLE numbers (n INT PRIMARY KEY);
INSERT INTO numbers VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);

SELECT
  s.id,
  n.n,
  SUBSTRING_INDEX(SUBSTRING_INDEX(s.sentence, ' ', n.n), ' ', -1) word
FROM
  numbers n INNER JOIN sentences s
  ON CHAR_LENGTH(s.sentence)
     -CHAR_LENGTH(REPLACE(s.sentence, ' ', ''))>=n.n-1
ORDER BY
  s.id, n.n

然后我们可以将此查询加入到包含必须替换的错误单词的表中:

SELECT
  id,
  n,
  REPLACE(word, COALESCE(bad, ''), good) AS new_word
FROM (
  SELECT
    s.id,
    n.n,
    SUBSTRING_INDEX(SUBSTRING_INDEX(s.sentence, ' ', n.n), ' ', -1) word
  FROM
    numbers n INNER JOIN sentences s
    ON CHAR_LENGTH(s.sentence)
       -CHAR_LENGTH(REPLACE(s.sentence, ' ', ''))>=n.n-1
  ORDER BY
    s.id, n.n
  ) w LEFT JOIN words
  ON w.word = bad

注意LEFT JOINCOALESCE(..., '') - 最后使用GROUP BYGROUP_CONCAT,您可以回复字符串:

SELECT
  id,
  GROUP_CONCAT(new_word ORDER BY n SEPARATOR ' ') AS new_sentence
FROM (
  SELECT
    id,
    n,
    REPLACE(word, COALESCE(bad, ''), good) AS new_word
  FROM (
    SELECT
      s.id,
      n.n,
      SUBSTRING_INDEX(SUBSTRING_INDEX(s.sentence, ' ', n.n), ' ', -1) word
    FROM
      numbers n INNER JOIN sentences s
      ON CHAR_LENGTH(s.sentence)
         -CHAR_LENGTH(REPLACE(s.sentence, ' ', ''))>=n.n-1
    ORDER BY
      s.id, n.n
    ) w LEFT JOIN words
    ON w.word = bad
  ) s

请查看它有效here。我不建议你使用这个解决方案,它不会非常高效,它更像是一个“黑客”,而不是一个真正的解决方案,更好地在这里使用PHP,但我希望你能从这个答案中学到新东西:)< / p>

可以做出一些简单的改进:

ON w.word = bad

这将仅匹配完全相同的单词(可能不区分大小写,但它取决于表定义的方式),并且它不支持像,...这样的分隔符。 ! ?等

而且:

REPLACE(word, COALESCE(bad, ''), good) AS new_word

将区分大小写。可以改进,但我建议你在PHP中做这些改进:)

答案 2 :(得分:1)

由于这是关于学习,并且仅在MYSQL中进行(而不是实现PHP更好的方式来实现它,并且男孩有更好的方法来实现它),我提出了一些东西来学习一些mysql技术。

关于唯一对我而言非常有趣的内容是set @sql1行。也许可以写几个大段落。现在,我只是介绍它。

架构

-- drop table badGood;
create table badGood
(   -- maps bad words to good
    id int auto_increment primary key,
    bad varchar(100) not null,
    good varchar(100) not null,
    dtAdded datetime not null
);
-- truncate table badGood;
insert badGood(bad,good,dtAdded) values ('god','gosh',now()),('rumpus','rear section',now());

-- drop table posts;
create table posts
(   postId int auto_increment primary key,
    orig varchar(1000) not null,
    cleanified varchar(1000) not null,
    dtAdded datetime not null, -- when it was inserted into system, ready for cleaning
    dtCleaned datetime null,    -- when it was cleaned
    isViewable int not null -- or bool, whatever. 0=No (not ready yet), 1=Yes (clean)
);
-- truncate table posts;

-- drop table xxx;
create table xxx
(   -- this table will contain one row for every word passed to stored proc,
    -- ordered by word sequence left to right in sentence
    -- order by meaning column "id" (auto_inc). Note, there is no guarantee, in fact expect it not to happen,
    -- that for any given postId, that the id's will be consecutive, but they will be in order
    --
    -- Reason being, multiple concurrent access of posts coming index
    --
    -- Decided against making this a temp table inside stored proc, but it was considered
    id int auto_increment primary key,
    postId int not null,    -- avoid FK for now due to speed
    word varchar(50) not null,  -- word as presented by poster guy
    word2 varchar(50) null, -- a more rated-G version of the word that is substituted
    isDirty int not null,   -- or bool, whatever. 0=clean, 1=naughty
    key(postId)
);
-- truncate table xxx;

存储过程

DROP PROCEDURE IF EXISTS cleanAndInsert;
delimiter $$
CREATE PROCEDURE cleanAndInsert
(   suspectTxt varchar(255) # this text is suspect. Might contain non G-rated words
    # other parameters too probably
)
BEGIN
    declare insertedId int; -- this will house the PK value of the postId

    insert posts(orig,cleanified,dtAdded,dtCleaned,isViewable) values (suspectTxt,'',now(),null,0); -- insert the passed string
    set @insertedId:=LAST_INSERT_ID();  # now we have the PK id just inserted
    -- the concat routine below is VERY FRAGILE to write, so as the sql string is slowly tweaked into perfection, with one working at that moment
    -- I rem it out and create a new version under it, so the slightest error does not set me back 10 minutes (at least)
    -- SET @sql1 = CONCAT("INSERT INTO xxx (word) VALUES ('",REPLACE((SELECT GROUP_CONCAT(orig) AS colx FROM posts where id=1), " ", "',null,0),('"),"');");
    -- SET @sql1 = CONCAT("INSERT INTO xxx (postId,word) VALUES (",@insertedId,",'",REPLACE((SELECT GROUP_CONCAT(orig) AS colx posts where postId=@insertedId), " ", "',null,0),('"),"',null,0);");
    SET @sql1 = CONCAT("INSERT INTO xxx (postId,word,word2,isDirty) VALUES (",@insertedId,",'",REPLACE((SELECT GROUP_CONCAT(orig) as colx FROM posts where postId=@insertedId), " ", "',null,0),(¿^?fish╔&®,'"),"',null,0);");
    -- select @sql1;    -- debugging purposes, rem'd out

    -- Ideally @insertedId is inserted in the SET @sql1 line a few above, and NOT with the fish hard-coded bizareness, but it was too fragile
    -- and time consuming. So this is an ugly hack and nothing to be proud of. So fixing it is a "TO DO"
    set @sql2=replace(@sql1,'¿^?fish╔&®',@insertedId); -- This is the insert statement to run to blast out the words
    -- select @sql2; -- debugging purposes, rem'd out.

    PREPARE stmt FROM @sql2;    -- you now have a prepared stmt string to execute (which inserts words into table xxx)
    EXECUTE stmt;

    -- now the clean word section
    update xxx x
    join badGood bg
    on bg.bad=x.word
    set x.isDirty=1,x.word2=bg.good
    where postId=@insertedId;

    -- I know, this is lame, but it allows us to use word2 simply as the final word and simplify our UPDATE posts after this block
    update xxx
    set word2=word
    where postId=@insertedId and isDirty=0;

    -- now the update section, to save the much cleaner string out to the posts table
    update posts
    set cleanified=
    (  select group_concat(word2 ORDER BY id SEPARATOR ' ') as xyz
       from xxx where postId=@insertedId
    ), isViewable=1, dtCleaned=now()
    where postId=@insertedId;

    -- one could do a "delete from xxx where postId=@insertedId" if they wanted to. I kept it for debugging. Others delete the rows

    select @insertedId as id;   -- useful for calling routine, telling it the PK value
END
$$

测试(调用存储过程)

in PHP, you would just call it with a normal query, starting with the $sql beginning with "call ..."

call cleanAndInsert('I type acceptable sentences'); -- returns 1 row, id is 1
call cleanAndInsert('Stan you can kiss my rumpus'); -- returns 1 row, id is 2
-- note this is very easy to trick, such as a naughty word not surrounded by whitespace, or broken out with spaces like "r u m p u s"

结果

select * from posts order by postId desc;
+--------+-----------------------------+-----------------------------------+---------------------+---------------------+------------+
| postId | orig                        | cleanified                        | dtAdded             | dtCleaned           | isViewable |
+--------+-----------------------------+-----------------------------------+---------------------+---------------------+------------+
|      2 | Stan you can kiss my rumpus | Stan you can kiss my rear section | 2015-09-22 11:08:29 | 2015-09-22 11:08:29 |          1 |
|      1 | I type acceptable sentences | I type acceptable sentences       | 2015-09-22 11:08:23 | 2015-09-22 11:08:23 |          1 |
+--------+-----------------------------+-----------------------------------+---------------------+---------------------+------------+

结束

这是为了学习。把它当成它。