如何使用PHP从bengali文本中提取关键字

时间:2016-04-24 09:41:46

标签: php keyword extraction

我想使用php从孟加拉语文本文件中自动提取关键字。我有这段代码用于阅读孟加拉语文本文件。

<?php
$target_path =  $_FILES['uploadedfile']['name']; 
header('Content-Type: text/plain;charset=utf-8');
$fp = fopen($target_path, 'r') or die("Can't open CEDICT.");
$i = 0;
while ($line = fgets($fp, 1024)) 
    {
        print $line;
        $i++;
    }
fclose($fp) or die("Can't close file.");

我发现以下代码可以提取最常见的10个关键字,但它不适用于孟加拉语文本。我应该做些什么改变?

    function extractCommonWords($string){
      $stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');

      $string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
      $string = trim($string); // trim the string
      $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
      $string = strtolower($string); // make it lowercase

      preg_match_all('/\b.*?\b/i', $string, $matchWords);
      $matchWords = $matchWords[0];

      foreach ( $matchWords as $key=>$item ) {
          if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
              unset($matchWords[$key]);
          }
      }   
      $wordCountArr = array();
      if ( is_array($matchWords) ) {
          foreach ( $matchWords as $key => $val ) {
              $val = strtolower($val);
              if ( isset($wordCountArr[$val]) ) {
                  $wordCountArr[$val]++;
              } else {
                  $wordCountArr[$val] = 1;
              }
          }
      }
      arsort($wordCountArr);
      $wordCountArr = array_slice($wordCountArr, 0, 10);
      return $wordCountArr;
}

请帮助:(

1 个答案:

答案 0 :(得分:0)

您应该进行简单的更改:

  • 用适当的孟加拉语停用词代替$stopWords数组中的停用词
  • 删除此字符串$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string);,因为孟加拉语sybmols与此模式不匹配

完整代码如下:

<?php

function extractCommonWords($string){
    // replace array below with proper Bengali stopwords
    $stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');

    $string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
    $string = trim($string); // trim the string
    // remove this preg_replace because Bengali sybmols doesn't match this pattern
    // $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
    $string = strtolower($string); // make it lowercase

    preg_match_all('/\s.*?\s/i', $string, $matchWords);
    $matchWords = $matchWords[0];

    foreach ( $matchWords as $key=>$item ) {
        if ( $item == '' || in_array(strtolower(trim($item)), $stopWords) || strlen($item) <= 3 ) {
            unset($matchWords[$key]);
        }
    }
    $wordCountArr = array();
    if ( is_array($matchWords) ) {
        foreach ( $matchWords as $key => $val ) {
            $val = trim(strtolower($val));
            if ( isset($wordCountArr[$val]) ) {
                $wordCountArr[$val]++;
            } else {
                $wordCountArr[$val] = 1;
            }
        }
    }
    arsort($wordCountArr);
    $wordCountArr = array_slice($wordCountArr, 0, 10);
    return $wordCountArr;
}

$string = <<<EOF
টিপ বোঝে না, টোপ বোঝে না টিপ বোঝে না, কেমন বাপু লোক
EOF;
var_dump(extractCommonWords($string), $string);

输出将是:

array(4) {
  ["বোঝে"]=>
  int(2)
  ["টোপ"]=>
  int(1)
  ["না"]=>
  int(1)
  ["কেমন"]=>
  int(1)
}
string(127) "টিপ বোঝে না, টোপ বোঝে না টিপ বোঝে না, কেমন বাপু লোক"