C＆Boost：编码/解码UTF-8-Java 学习之路

我正在尝试做一个非常简单的任务：取一个unicode感知 wstring 并将其转换为 string ，编码为UTF8字节，然后相反的方式：取一个包含UTF8字节的 string 并将其转换为unicode感知 wstring .

问题是，我需要它跨平台，我需要它与Boost一起工作......而我似乎无法想办法让它工作 . 我一直在玩弄

试图将代码转换为使用 stringstream / wstringstream 而不是任何文件，但似乎没有任何作用 .

例如，在Python中它看起来像这样：

>>> u"שלום"
u'\u05e9\u05dc\u05d5\u05dd'
>>> u"שלום".encode("utf8")
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'.decode("utf8")
u'\u05e9\u05dc\u05d5\u05dd'

我最终追求的是：

wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
wstring ws(uchars);
string s = encode_utf8(ws); 
// s now holds "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d"
wstring ws2 = decode_utf8(s);
// ws2 now holds {0x5e9, 0x5dc, 0x5d5, 0x5dd}

我真的不想在ICU上添加另一种依赖关系......或者根据我的理解，应该可以使用Boost .

一些示例代码将非常感谢！谢谢

4 回答

谢谢大家，但最终我采用了http://utfcpp.sourceforge.net/ - 它非常轻巧且易于使用 . 我在这里分享一个演示代码，如果有人发现它有用：

inline void decode_utf8(const std::string& bytes, std::wstring& wstr)
{
    utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr, std::string& bytes)
{
    utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));
}

用法：

wstring ws(L"\u05e9\u05dc\u05d5\u05dd");
string s;
encode_utf8(ws, s);

回复于 2024-05-05T20:43:41+08:00

评论中已经有一个提升链接，但在几乎标准的C 0x中，有 wstring_convert 这样做

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
int main()
{
    wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
    std::string s = conv.to_bytes(uchars);
    std::wstring ws2 = conv.from_bytes(s);
    std::cout << std::boolalpha
              << (s == "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d" ) << '\n'
              << (ws2 == uchars ) << '\n';
}

使用MS Visual Studio 2010 EE SP1或CLang 2.9编译时的输出

true 
true

回复于 2024-05-05T20:43:41+08:00

11
Boost.Locale在Boost 1 . 48（2011年11月15日）发布，更容易转换为UTF8 / 16

以下是文档中的一些方便示例：
```
string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);
```
几乎和Python编码/解码一样简单:)

请注意，Boost.Locale不是仅限标头的库 .
回复于 2024-05-05T20:43:41+08:00
2

对于处理utf8的 std::string / std::string / std::wstring ，请参阅TINYUTF8 .

结合<codecvt>，您可以从/向utf8转换/到每个编码，然后通过上面的库处理 .

回复于 2024-05-05T20:43:41+08:00

C＆Boost：编码/解码UTF-8

4 回答

相关问题