从终端C中的字符串中打印出unicode char-Java 学习之路

将包含unicode字符的字符串打印到终端时，unicode字符会正确显示 . 但是当我尝试将unicode char隔离成一个字符串并将其打印出来时，它会打印为“？” . 如何从字符串中提取unicode char并将其放在一个新字符串中而不会丢失其unicode内存？

text是一个全局的std :: string

这就是我将unicode char拉出来的方式：

stringstream ss;
string ret = "";
ss << text[index];
ss >> ret;

另外，我不能使用wchar，wstring或任何与unicode有关的std库 .

2 回答

1
```
ss << text[index];
```
我的猜测是 text 是一个C String或其他实际上使用字节（在C和C中也称为 char ）作为存储的东西 . 所以你的 [] 索引操作不会给你整个unicode代码点，而只是它的一个字节 .

Edit 你已经添加了

“我不能使用......任何与unicode有关的标准库”

这是一个废话要求 . 这意味着您必须重新实现unicode功能，并且's a) huge thing and b) a bringer of bugs. So, for everything that is proper: you'使用 std::stringstream ，您也可以使用宽字符等 .
回复于 2024-04-28T11:19:47+08:00

假设您正在使用 UTF-8 ，问题是单个 UTF-8 个字符可以占据 1 到 4 个字节（理论上为 6 ） .

为了遍历它们，您需要计算每个字符的大小 . 以下代码使用一个简单的表，但您也可以通过位操作获得创意：

#include <string>
#include <vector>
#include <iostream>

// return individual utf-8 chars as a vector of strings
std::vector<std::string> utf8_split_chars(std::string const& s)
{
    // table to get the size of a utf-8 character
    static const char u8char_size[] =
    {
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
        , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
        , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
        , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
        , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
        , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
        , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
        , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
        , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
        , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
        , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
        , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
        , 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
        , 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
        , 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3
        , 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 0, 0
    };

    std::vector<std::string> utf8_chars;

    // increment the index i by the size of each utf-8 char
    for(auto i = 0U; i < s.size(); i += u8char_size[(unsigned char)s[i]])
    {
        utf8_chars.emplace_back(&s[i], u8char_size[(unsigned char)s[i]]);
    }

    return utf8_chars;
}

int main()
{
    std::string s = u8"建造 otoño κάτω";

    std::cout << "s: " << s <<" " << s.size() << " bytes" << '\n';

    auto chars = utf8_split_chars(s);

    for(auto const& c: chars)
        std::cout << "c: " << c << '\n';
}

Output:

s: 建造 otoño κάτω 22 bytes
c: 建
c: 造
c:  
c: o
c: t
c: o
c: ñ
c: o
c:  
c: κ
c: ά
c: τ
c: ω

回复于 2024-04-28T11:19:47+08:00

从终端C中的字符串中打印出unicode char

2 回答

相关问题