以编程方式确定逗号分隔列表和段落之间的差异

时间:2011-06-16 20:31:58

标签: string csv query-string user-input

我正在进行数据迁移,在旧系统上,允许用户在大型文本字段中输入他们的兴趣,而根本不遵循格式化说明。结果一些人用生物格式写了,其他人用逗号分隔的列表格式写。还有一些其他格式,但这些是主要格式。

现在我知道如何识别以逗号分隔的列表(CSL)。这很容易。但是如何确定一个字符串是一个CSL(可能是一个有两个术语或短语的短文本)还是只是一个人写的包含逗号的段落?

有人认为我有自动忽略包含标点符号的字符串和不包含逗号的字符串。但是,我担心这还不够,或者还有很多不足之处。所以我想查询社区,看看你们的想法。与此同时,我将尝试我的想法。

更新: 好的,我有我的算法。它在下面......

我的代码:


//Process our interests text field and get the list of interests
function process_interests($interests)
{
  $interest_list = array();

  if ( preg_match('/(\.)/', $interests)  0 && $word_cnt > 0)
      $ratio = $delimiter_cnt / $word_cnt;

    //If delimiter is found with the right ratio then we can go forward with this.
    //Should not be any more the 5 words per delimiter (ratio = delimiter / words ... this must be at least 0.2)
    if (!empty($delimiter) && $ratio > 0 && $ratio >= 0.2)
    {
      //Check for label with colon after it
      $interests = remove_colon($interests);

      //Now we make our array
      $interests = explode($delimiter, $interests);

      foreach ($interests AS $val)
      {
        $val = humanize($val);

        if (!empty($val))
          $interest_list[] = $val;
      }
    }
  }

  return $interest_list;
}

//Cleans up strings a bit
function humanize($str)
{
  if (empty($str))
    return ''; //Lets not waste processing power on empty strings

  $str = remove_colon($str); //We do this one more time for inline labels too.
  $str = trim($str); //Remove unused bits
  $str = ltrim($str, ' -'); //Remove leading dashes
  $str = str_replace('  ', ' ', $str); //Remove double spaces, replace with single spaces
  $str = str_replace(array(".", "(", ")", "\t"), '', $str); //Replace some unwanted junk

  if ( strtolower( substr($str, 0, 3) ) == 'and')
    $str = substr($str, 3); //Remove leading "and" from term

  $str = ucwords(preg_replace('/[_]+/', ' ', strtolower(trim($str))));

  return $str;
}

//Check for label with colon after it and remove the label
function remove_colon($str)
{
  //Check for label with colon after it
  if (strstr($str, ':'))
  {
    $str = explode(':', $str); //If we find it we must remove it
    unset($str[0]); //To remove it we just explode it and take everything to the right of it.
    $str = trim(implode(':', $str)); //Sometimes colons are still used elsewhere, I am going to allow this
  }

  return $str;
}

感谢您的所有帮助和建议!

1 个答案:

答案 0 :(得分:1)

除了您提到的过滤之外,您还可以创建逗号数与字符串长度的比率。在CSL中,这个比率往往很高,在低段。您可以设置某种阈值,并根据条目是否具有足够高的比率进行选择。比率接近阈值的那些可能被标记为容易出错,然后可由主持人检查。