我将一组停用词设置为数组
$stopwords = array(
"a ",
"about ",
"above ",
"above ",
"across ",
"after ",
"afterwards ",
"again ",
"against ",
"all ",
"almost ",
"alone ",
"along ",
"already ",
"also ",
"although ",
"always ",
"am ",
"among ",
"amongst ",
"amoungst ",
"amount ",
"an ",
"and ",
"another ",
"any ",
"anyhow ",
"anyone ",
"anything ",
"anyway ",
"anywhere ",
"are ",
"around ",
"as ",
"at ",
"back ",
"be ",
"became ",
"because ",
"become ",
"becomes ",
"becoming ",
"been ",
"before ",
"beforehand ",
"behind ",
"being ",
"below ",
"beside ",
"besides ",
"between ",
"beyond ",
"bill ",
"both ",
"bottom ",
"but ",
"by ",
"can ",
"cannot ",
"cant ",
"co ",
"con ",
"could ",
"couldnt ",
"cry ",
"considered ",
"describe ",
"detail ",
"do ",
"did ",
"done ",
"down ",
"due ",
"during ",
"each ",
"eg ",
"eight ",
"either ",
"eleven ",
"else ",
"elsewhere ",
"empty ",
"enough ",
"etc ",
"even ",
"ever ",
"every ",
"everyone ",
"everything ",
"everywhere ",
"except ",
"few ",
"fifteen ",
"fify ",
"fill ",
"find ",
"fire ",
"five ",
"for ",
"former ",
"formerly ",
"forty ",
"found ",
"four ",
"from ",
"front ",
"full ",
"further ",
"get ",
"give ",
"go ",
"had ",
// "has ",
"hasnt ",
"have ",
"he ",
"hence ",
"her ",
"here ",
"hereafter ",
"hereby ",
"herein ",
"hereupon ",
"hers ",
"herself ",
"him ",
"himself ",
"his ",
"how ",
"however ",
"hundred ",
"ie ",
"if ",
"In",
"inc ",
"indeed ",
"interest ",
"into ",
"is ",
"it ",
"its ",
"itself ",
"keep ",
"known ",
// "last ",
"latter ",
"latterly ",
"least ",
"legend ",
"less ",
"ltd ",
// "made ",
"many ",
"may ",
"me ",
"meanwhile ",
"might ",
"mill ",
"mine ",
"more ",
"moreover ",
// "most ",
"mostly ",
"move ",
"much ",
"must ",
"my ",
"myself ",
"name ",
"namely ",
"neither ",
"never ",
"nevertheless ",
"next ",
"nine ",
"no ",
"nobody ",
"none ",
"noone ",
"nor ",
"nothing ",
"now ",
"nowhere ",
"of ",
"off ",
"often ",
"on ",
"once ",
"one ",
"only ",
"onto ",
"or ",
"other ",
"others ",
"otherwise ",
"our ",
"ours ",
"ourselves ",
"out ",
// "over ",
"own ",
"part ",
"per ",
"perhaps ",
"please ",
"popular ",
"put ",
"rather ",
"re ",
"same ",
"see ",
"seem ",
"seemed ",
"seeming ",
"seems ",
"serious ",
"several ",
"she ",
"should ",
"show ",
"since ",
"sincere ",
"six ",
"sixty ",
"so ",
"some ",
"somehow ",
"someone ",
"something ",
"sometime ",
"sometimes ",
"somewhere ",
"still ",
"such ",
"take ",
"technique ",
"ten ",
"than ",
"that ",
"the ",
"their ",
"them ",
"themselves ",
"then ",
"thence ",
"there ",
"thereafter ",
"thereby ",
"therefore ",
"therein ",
"thereupon ",
"these ",
"they ",
"thickv ",
"term ",
"thin ",
"third ",
"this ",
"those ",
"though ",
"three ",
"through ",
"throughout ",
"thru ",
"thus ",
"to ",
"together ",
"too ",
"top ",
"toward ",
"towards ",
"twelve ",
"twenty ",
"two ",
"un ",
"under ",
"until ",
"up ",
"upon ",
"us ",
"very ",
"via ",
"was ",
"we ",
"well ",
"were ",
"what ",
"whatever ",
"when ",
"whence ",
"whenever ",
"where ",
"whereafter ",
"whereas ",
"whereby ",
"wherein ",
"whereupon ",
"wherever ",
"whether ",
"which ",
"while ",
"whither ",
"who ",
"whoever ",
"whole ",
"whom ",
"whose ",
"why ",
"will ",
"with ",
"within ",
"without ",
"would ",
"yet ",
"you ",
"your ",
"yours ",
"yourself ",
"yourselves ",
"the ",
"likely ",
"names "
);
在我试图避免切断字符串并希望仅从我的禁用词列表中替换整个匹配项(到NULL值)之后,您可能已经注意到空格。
意识到str_replace可能是功能和优点的次要因素,我转向构建preg_replace数组,试图使用单词边界对整个单词进行正则表达式。
$pregreplacestopwords = array(
"/\ba\b/",
"/\babout\b/",
"/\babove\b/",
"/\babove\b/",
"/\bacross\b/",
"/\bafter\b/",
"/\bafterwards\b/",
"/\bagain\b/",
"/\bagainst\b/",
"/\ball\b/",
"/\balmost\b/",
"/\balone\b/",
"/\balong\b/",
"/\balready\b/",
"/\balso\b/",
"/\balthough\b/",
"/\balways\b/",
"/\bam\b/",
"/\bamong\b/",
"/\bamongst\b/",
"/\bamoungst\b/",
"/\bamount\b/",
"/\ban\b/",
"/\band\b/",
"/\banother\b/",
"/\bany\b/",
"/\banyhow\b/",
"/\banyone\b/",
"/\banything\b/",
"/\banyway\b/",
"/\banywhere\b/",
"/\bare\b/",
"/\baround\b/",
"/\bas\b/",
"/\bat\b/",
"/\bback\b/",
"/\bbe\b/",
"/\bbecame\b/",
"/\bbecause\b/",
"/\bbecome\b/",
"/\bbecomes\b/",
"/\bbecoming\b/",
"/\bbeen\b/",
"/\bbefore\b/",
"/\bbeforehand\b/",
"/\bbehind\b/",
"/\bbeing\b/",
"/\bbelow\b/",
"/\bbeside\b/",
"/\bbesides\b/",
"/\bbetween\b/",
"/\bbeyond\b/",
"/\bbill\b/",
"/\bboth\b/",
"/\bbottom\b/",
"/\bbut\b/",
"/\bby\b/",
"/\bcan\b/",
"/\bcannot\b/",
"/\bcant\b/",
"/\bco\b/",
"/\bcon\b/",
"/\bcould\b/",
"/\bcouldnt\b/",
"/\bcry\b/",
"/\bconsidered\b/",
"/\bdescribe\b/",
"/\bdetail\b/",
"/\bdo\b/",
"/\bdid\b/",
"/\bdone\b/",
"/\bdown\b/",
"/\bdue\b/",
"/\bduring\b/",
"/\beach\b/",
"/\beg\b/",
"/\beight\b/",
"/\beither\b/",
"/\beleven\b/",
"/\belse\b/",
"/\belsewhere\b/",
"/\bempty\b/",
"/\benough\b/",
"/\betc\b/",
"/\beven\b/",
"/\bever\b/",
"/\bevery\b/",
"/\beveryone\b/",
"/\beverything\b/",
"/\beverywhere\b/",
"/\bexcept\b/",
"/\bfew\b/",
"/\bfifteen\b/",
"/\bfify\b/",
"/\bfill\b/",
"/\bfind\b/",
"/\bfire\b/",
"/\bfive\b/",
"/\bfor\b/",
"/\bformer\b/",
"/\bformerly\b/",
"/\bforty\b/",
"/\bfound\b/",
"/\bfour\b/",
"/\bfrom\b/",
"/\bfront\b/",
"/\bfull\b/",
"/\bfurther\b/",
"/\bget\b/",
"/\bgive\b/",
"/\bgo\b/",
"/\bhad\b/",
"/\b//has\b/",
"/\bhasnt\b/",
"/\bhave\b/",
"/\bhe\b/",
"/\bhence\b/",
"/\bher\b/",
"/\bhere\b/",
"/\bhereafter\b/",
"/\bhereby\b/",
"/\bherein\b/",
"/\bhereupon\b/",
"/\bhers\b/",
"/\bherself\b/",
"/\bhim\b/",
"/\bhimself\b/",
"/\bhis\b/",
"/\bhow\b/",
"/\bhowever\b/",
"/\bhundred\b/",
"/\bie\b/",
"/\bif\b/",
"/\bIn\b/",
"/\binc\b/",
"/\bindeed\b/",
"/\binterest\b/",
"/\binto\b/",
"/\bis\b/",
"/\bit\b/",
"/\bits\b/",
"/\bitself\b/",
"/\bkeep\b/",
"/\bknown\b/",
"/\b//last\b/",
"/\blatter\b/",
"/\blatterly\b/",
"/\bleast\b/",
"/\blegend\b/",
"/\bless\b/",
"/\bltd\b/",
"/\b//made\b/",
"/\bmany\b/",
"/\bmay\b/",
"/\bme\b/",
"/\bmeanwhile\b/",
"/\bmight\b/",
"/\bmill\b/",
"/\bmine\b/",
"/\bmore\b/",
"/\bmoreover\b/",
"/\bmost\b/",
"/\bmostly\b/",
"/\bmove\b/",
"/\bmuch\b/",
"/\bmust\b/",
"/\bmy\b/",
"/\bmyself\b/",
"/\bname\b/",
"/\bnamely\b/",
"/\bneither\b/",
"/\bnever\b/",
"/\bnevertheless\b/",
"/\bnext\b/",
"/\bnine\b/",
"/\bno\b/",
"/\bnobody\b/",
"/\bnone\b/",
"/\bnoone\b/",
"/\bnor\b/",
"/\bnothing\b/",
"/\bnow\b/",
"/\bnowhere\b/",
"/\bof\b/",
"/\boff\b/",
"/\boften\b/",
"/\bon\b/",
"/\bonce\b/",
"/\bone\b/",
"/\bonly\b/",
"/\bonto\b/",
"/\bor\b/",
"/\bother\b/",
"/\bothers\b/",
"/\botherwise\b/",
"/\bour\b/",
"/\bours\b/",
"/\bourselves\b/",
"/\bout\b/",
"/\b//over\b/",
"/\bown\b/",
"/\bpart\b/",
"/\bper\b/",
"/\bperhaps\b/",
"/\bplease\b/",
"/\bpopular\b/",
"/\bput\b/",
"/\brather\b/",
"/\bre\b/",
"/\bsame\b/",
"/\bsee\b/",
"/\bseem\b/",
"/\bseemed\b/",
"/\bseeming\b/",
"/\bseems\b/",
"/\bserious\b/",
"/\bseveral\b/",
"/\bshe\b/",
"/\bshould\b/",
"/\bshow\b/",
"/\bsince\b/",
"/\bsincere\b/",
"/\bsix\b/",
"/\bsixty\b/",
"/\bso\b/",
"/\bsome\b/",
"/\bsomehow\b/",
"/\bsomeone\b/",
"/\bsomething\b/",
"/\bsometime\b/",
"/\bsometimes\b/",
"/\bsomewhere\b/",
"/\bstill\b/",
"/\bsuch\b/",
"/\btake\b/",
"/\btechnique\b/",
"/\bten\b/",
"/\bthan\b/",
"/\bthat\b/",
"/\bthe\b/",
"/\btheir\b/",
"/\bthem\b/",
"/\bthemselves\b/",
"/\bthen\b/",
"/\bthence\b/",
"/\bthere\b/",
"/\bthereafter\b/",
"/\bthereby\b/",
"/\btherefore\b/",
"/\btherein\b/",
"/\bthereupon\b/",
"/\bthese\b/",
"/\bthey\b/",
"/\bthickv\b/",
"/\bterm\b/",
"/\bthin\b/",
"/\bthird\b/",
"/\bthis\b/",
"/\bthose\b/",
"/\bthough\b/",
"/\bthree\b/",
"/\bthrough\b/",
"/\bthroughout\b/",
"/\bthru\b/",
"/\bthus\b/",
"/\bto\b/",
"/\btogether\b/",
"/\btoo\b/",
"/\btop\b/",
"/\btoward\b/",
"/\btowards\b/",
"/\btwelve\b/",
"/\btwenty\b/",
"/\btwo\b/",
"/\bun\b/",
"/\bunder\b/",
"/\buntil\b/",
"/\bup\b/",
"/\bupon\b/",
"/\bus\b/",
"/\bvery\b/",
"/\bvia\b/",
"/\bwas\b/",
"/\bwe\b/",
"/\bwell\b/",
"/\bwere\b/",
"/\bwhat\b/",
"/\bwhatever\b/",
"/\bwhen\b/",
"/\bwhence\b/",
"/\bwhenever\b/",
"/\bwhere\b/",
"/\bwhereafter\b/",
"/\bwhereas\b/",
"/\bwhereby\b/",
"/\bwherein\b/",
"/\bwhereupon\b/",
"/\bwherever\b/",
"/\bwhether\b/",
"/\bwhich\b/",
"/\bwhile\b/",
"/\bwhither\b/",
"/\bwho\b/",
"/\bwhoever\b/",
"/\bwhole\b/",
"/\bwhom\b/",
"/\bwhose\b/",
"/\bwhy\b/",
"/\bwill\b/",
"/\bwith\b/",
"/\bwithin\b/",
"/\bwithout\b/",
"/\bwould\b/",
"/\byet\b/",
"/\byou\b/",
"/\byour\b/",
"/\byours\b/",
"/\byourself\b/",
"/\byourselves\b/",
"/\bthe\b/",
"/\blikely\b/",
"/\bnames\b/"
);
为它创建一个空白数组:
$pgreplace = array
让我们以单词“B.A.”
为例,将其放入一个字符串变量中,使其成为一个有趣的句子。
$string = 'I got my “B.A.” from...';
我尝试过的一些方法就是破坏停用词,
尝试诸如
之类的事情preg_replace($ pregreplacestopwords,$ pregreplacestopwords,$ string);
只是充满了错误
Warning: preg_replace(): Compilation failed: missing terminating ] for character class at offset 1951 in C:\wamp64\www\pg\test.php on line 664
Warning: preg_replace(): Empty regular expression in C:\wamp64\www\pg\test.php on line 666
NULL
Warning: preg_replace(): Unknown modifier '/' in C:\wamp64\www\pg\test.php on line 670
NULL
通过$implodestopwords = implode("|", array_map("trim",array_filter($stopwords)));
一个|关于我们|以上|以上|横跨|后|事后|再次|反对|全部|差点|独|一起|已经|还
等等。
尝试将此付诸行动
$pattern = '/\b(' . $implodestopwords . ')\b/i';
$string = preg_replace($pattern, "", $string);
var_dump($string);
输出:
我得到了“B ......”......
如何修改preg_replace以仅匹配确切的单词并将其从数组中的大量单词列表中删除?
这里有完整的脚本:https://pastebin.com/vwbNjhs9
答案 0 :(得分:0)
也许不是使用preg_replace(),您可能只是尝试将字符串转换为数组,然后在其上循环检查每个单词是否在您的停用词数组中。
试试这个,看它是否有效:
$string = 'I got my "B.A." from...';
$string = preg_replace('/\s{1,}/', ' ', $string); //<--insure only one space between characters.
$array = explode(' ', $string);
for($i = 0; $i < count($array); $i++){
if(in_array($array[$i] . ' ', $stopwords)){ //<-- Only concatenated space because of your
//trailing spaces in the stopwords array.
$array[$i] = ''; //<--Removed the word.
}
}
$newString = implode(' ', $array); //<--Turn the array back to a string.
echo $newString; //<---Outputs "I got "B.A." from...".
这种方法可以让你对你决定对每个找到的词做什么有很多控制。