Question

I have one form in which following inputs are taken from user:

Blog title
Blog Description
Permalink to access blog

I am converting Blog title to lower case and replacing white spaces with dash(-) and storing it in Permalink to access blog .
Below is the code to handle this operation:

setlocale(LC_ALL, 'en_US.UTF8');

function toAscii($str, $replace=array(), $delimiter='-') {
  if( !empty($replace) ) {
     $str = str_replace((array)$replace, ' ', $str);
  }
     $clean = iconv('UTF-8', 'ASCII//TRANSLIT', $str);
     $clean = preg_replace("/[^a-zA-Z0-9\/_|+ -]/", '', $clean);
     $clean = strtolower(trim($clean, '-'));
     $clean = preg_replace("/[\/_|+ -]+/", $delimiter, $clean);
     return $clean;
}    

$prmlkn = toAscii($blog_headline, $replace=array(), $delimiter='-');

This code all works fine till Blog headline is in English. But if user types in Hindi then i am only getting - as permalink means it is not recognizing Hindi POST values.

Answer 1

This happens because Hindi uses the extended character set in UTF-8 and you are converting to ASCII that only provides latin characters, thus:

$str = "नमस्ते"
$clean = iconv('UTF-8', 'ASCII//TRANSLIT', $str); // clean is an empty string ""

According to rfc3986

Characters

...

The ABNF notation defines its terminal values to be non-negative
integers (codepoints) based on the US-ASCII coded character set
[ASCII]. Because a URI is a sequence of characters, we must invert
that relation in order to understand the URI syntax. Therefore, the

integer values used by the ABNF must be mapped back to their
corresponding characters via US-ASCII in order to complete the syntax rules.

A URI is composed from a limited set of characters consisting of
digits, letters, and a few graphic symbols. A reserved subset of
those characters may be used to delimit syntax components within a
URI while the remaining characters, including both the unreserved set and those reserved characters not acting as delimiters, define each
component's identifying data.

You might be better off using urlencode() but note this might make a really ugly and long permalink

$str = "नमस्ते hello";
$clean = urlencode("$str");
printf("%s",$clean);

would result in a valid but ulgy:

%E0%A4%A8%E0%A4%AE%E0%A4%B8%E0%A5%8D%E0%A4%A4%E0%A5%87+hello

Generate permalink to a blog post Hindi PHP

1 个答案: