从IRC日志中提取文本

时间:2012-11-06 13:10:31

标签: php regex irc lda

我想从irc日志中提取文本。我有来自irssi的常规IRC日志,如下所示:

00:12 -!- Barbora [post@gw1-nat-041.roburnet.sk] has joined #post.sk
00:12 -!- mirinda [~post@195.91.55.136] has quit [Broken pipe]
00:12 -!- rogue1 [post@86-41-114-24-dynamic.b-ras2.lmk.limerick.eircom.net] has joined #post.sk
00:12 -!- Komunista is now known as Anonym9901
00:13 -!- ajka [~post@78.141.102.209] has quit [Client exited]
00:16 < blackmamba> no fuj
00:16 < blackmamba> Komunista: lol
00:16 < blackmamba> "este trochu"
00:16 < blackmamba> "je taky velky"
00:17 -!- majopo [post@adsl-d192.84-47-63.t-com.sk] has quit [Client exited]
00:19 -!- Anonym9901 is now known as Komunista
00:19 -!- dido84 [post@BSN-143-83-49.dial-up.dsl.siol.net] has quit [Client exited]
00:19 < Komunista> no?
00:20 < Komunista> ja by som*nadavka*l
00:20 < Komunista> ako pes
00:20 -!- Komunista is now known as Anonym53560 

我需要的是这样输出:

no fuj lol este trochu je taky velky no ja by som*nadavka*l ako pes

所以,只是用空格分隔的单词,没有别的,没有刻痕,没有引号,问号等。我需要它作为LDA的输入。

尼克斯我将通过后期处理删除,我认为会更容易,或者?

我更喜欢PHP和正则表达式,我不擅长,这就是为什么我要求大家帮忙。

感谢您的时间!

编辑:

现在我使用这段代码(感谢m.buettner):

$input = ... ;
$smiles = [">:]", ":-)", ":)", ":o)", ":]", ":3", ":c)", ":>", "=]", "8)", "=)", ":}", ":^)", ">:D", ":-D", ":D", "8-D", "x-D", "X-D", "=-D", "=D", "=-3", "8-)", ">:[", ":-(", ":(", ":-c", ":c", ":-<", ":-[", ":[", ":{", ">.>", "<.<", ">.<", ">;]", ";-)", ";)", "*-)", "*)", ";-]", ";]", ";D", ";^)", ">:P", ":-P", ":P", "X-P", "x-p", ":-p", ":p", "=p", ":-Þ", ":Þ", ":-b", ":b", "=p", "=P", ">:o", ">:O", ":-O", ":O", "°o°", "°O°", ":O", "o_O", "o.O", "8-0", ">:\\", ">:/", ":-/", ":-.", ":\\", "=/", "=\\", ":S", ":'(", ";'("];

$input = str_replace($smiles, '', $input);
$resultStr = '';
preg_match_all('/^\d\d:\d\d\s+<[%|\s|@|+][_a-zA-Z0-9]*>\s([^\r\n]*)/m', $input, $matches);
$resultStr = implode(' ', $matches[1]);
$resultStr = preg_replace('/[^\w\s*]+/', '', $resultStr);

preg_match_all('/<[%|\s|@|+][_a-zA-Z0-9]*>/m', $input, $nicks);
$nicks[0] = str_replace(['<', '>', ' ', '%', '+', '$', '@'], '', $nicks[0]);
$resultStr = str_replace($nicks[0], '', $resultStr);

任何改善它的建议都将受到赞赏;)

1 个答案:

答案 0 :(得分:1)

这样的东西?

preg_match_all('/^\d\d:\d\d\s+<[^>]*>([^\r\n]*)/m', $input, $matches);

$resultStr = implode(' ', $matches[1]);
$resultStr = preg_replace('/[^\w\s*]+/', '', $resultStr);

首先,我们匹配hh:mm < name>之后的所有内容,直到该行结束。然后我们将这些结果与空格连接起来,然后我们删除所有非单词,非空格,非星号字符。将您要保留的其他字符添加到preg_replace

中的字符类中