我必须编写一个小程序来删除作为输入给出的字符串的重音。我还必须创建一个函数,将每个单个字符的重音替换为没有重音的相应字符,并且我有一个循环,为我的主要中的每个字符调用函数:
char func (char c)
{
string acc = "èé";
string norm = "ee";
char ret = c;
for(int i = 0; i < acc.size(); i++)
{
if(c == acc[i])
ret = acc[i];
}
return ret;
}
问题是如果我在main中提供字符串“é”作为输入,则字符串被视为大小为2的字符串(参见下面的示例),上面的函数被调用两次而不是一次。此外,作为函数输入的char不是正确的。我想我的功能里面有同样大小的问题。这种口音不应该被视为单一角色吗? (我使用的是UTF-8)
string s = "e";
cout << "size:" << s.size() << endl;
s = "è";
cout << "size:" << s.size() << endl;
OUTPUT
size:1
size:2
我已经使用wchar_t ans wstring类型解决了这个问题,但是我需要在一个更复杂的程序中插入这个函数,并且我可能希望避免更改所有代码来处理wstring。
我是否需要更改文件编码?实际的是:
text/x-c; charset=utf-8
是否可以使用普通字符串和字符编写这样的函数?
答案 0 :(得分:4)
您不应尝试使用简单的循环来自己执行此操作,尤其是在代码对安全性敏感的情况下。 There is often more than one way to represent the same character in Unicode,因此您可能只有一个代码点,也可能有两个代码点。例如:
const wchar_t text1[] = { L'e', 0x0301, 0 };
const wchar_t text2[] = { 0x0e9, 0 };
这两个字符串在打印时看起来是相等的(它们都显示é
),但是它们显然不相同,并且简单的==
检查将失败。您应该在搜索之前对字符串进行规范化,或者使用现有的自动为您执行此功能的函数。
Windows为此提供了NormalizeString
和FindStringOrdinal
,而ICU为此提供了unorm_compare
或usearch_first
。
const wchar_t text1[] = { L'e', 0x0301, 0 };
const wchar_t text2[] = { 0x0e9, 0 };
// Using Windows APIs, try to normalize the string first
int size = NormalizeString(NormalizationKC, text1, -1, nullptr, 0);
if (size == 0)
throw std::exception("Can't normalize");
auto text3 = std::make_unique<wchar_t[]>(size);
NormalizeString(NormalizationKC, text1, -1, text3.get(), size);
// Print out the three strings - they all look the same
std::wcout << text1 << std::endl;
std::wcout << text2 << std::endl;
std::wcout << text3.get() << std::endl;
// Verify if they are (or are not) equal
if (CompareStringOrdinal(text1, -1, text2, -1, false) == 2)
std::wcout << L"Original strings are equivalent\r\n";
else
std::wcout << L"Original strings are not equivalent\r\n";
if (CompareStringOrdinal(text3.get(), -1, text2, -1, false) == 2)
std::wcout << L"Normalized strings are equivalent\r\n";
else
std::wcout << L"Normalized strings are not equivalent\r\n";
// Verify if the string text2 can be found
if (FindStringOrdinal(FIND_FROMSTART, text1, -1, text2, -1, TRUE) != -1)
std::wcout << L"Original string contains the searched-for string\r\n";
else
std::wcout << L"Original string does not contain the searched-for string\r\n";
if (FindStringOrdinal(FIND_FROMSTART, text3.get(), -1, text2, -1, TRUE) != -1)
std::wcout << L"Normalized string contains the searched-for string\r\n";
else
std::wcout << L"Normalized string does not contain the searched-for string\r\n";
// Using ICU APIs, try to compare the normalized strings in one go
// (You can also manually normalize, like Windows, if you want to keep the
// normalized form around)
UErrorCode error{ U_ZERO_ERROR };
auto result = unorm_compare(reinterpret_cast<const UChar*>(text1), -1,
reinterpret_cast<const UChar*>(text2), -1, 0, &error);
if (!U_SUCCESS(error))
throw std::exception("Can't normalize");
if (result == 0)
std::wcout << L"[ICU] Normalized strings are equivalent\r\n";
else
std::wcout << L"[ICU] Normalized strings are NOT equivalent\r\n";
// Try searching; ICU handles the equivalency of (non-)normalized
// characters automatically.
auto search = usearch_open(reinterpret_cast<const UChar*>(text2), -1,
reinterpret_cast<const UChar*>(text1), -1, "", nullptr, &error);
if (!U_SUCCESS(error))
throw std::exception("Can't open search");
auto index = usearch_first(search, &error);
if (!U_SUCCESS(error))
throw std::exception("Can't search");
if (index != USEARCH_DONE)
std::wcout << L"[ICU] Original string contains the searched-for string\r\n";
else
std::wcout << L"[ICU] Original string does not contain the searched-for string\r\n";
usearch_close(search);
这将产生以下输出:
é
é
é
Original strings are not equivalent
Normalized strings are equivalent
Original string does not contain the searched-for string
Normalized string contains the searched-for string
[ICU] Normalized strings are equivalent
[ICU] Original string contains the searched-for string
答案 1 :(得分:1)
将角色存储在wchar_t
中,如此
wchar_t text = L'é';
您还可以在wstring
中存储特殊字符:
wstring text = L"étoile";
如果您仍然需要将wchar_t
(或wstring
)中的潜在特殊字符与char
或(string
)进行比较,this thread说明如何做得很好。