首页 文章

为什么在C中拆分字符串比Python慢?

提问于
浏览
85

我正在尝试将一些代码从Python转换为C,以便获得一点速度并提高我生锈的C技能 . 昨天我感到很震惊,因为在Python中从stdin读取行的天真实现比C快得多(参见this) . 今天,我终于弄明白了如何使用合并分隔符在C中拆分字符串(类似于python _846603的教训) .

Python Code:

#!/usr/bin/env python
from __future__ import print_function                                            
import time
import sys

count = 0
start_time = time.time()
dummy = None

for line in sys.stdin:
    dummy = line.split()
    count += 1

delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
    lps = int(count/delta_sec)
    print("  Crunch Speed: {0}".format(lps))
else:
    print('')

C++ Code:

#include <iostream>                                                              
#include <string>
#include <sstream>
#include <time.h>
#include <vector>

using namespace std;

void split1(vector<string> &tokens, const string &str,
        const string &delimiters = " ") {
    // Skip delimiters at beginning
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);

    // Find first non-delimiter
    string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos) {
        // Found a token, add it to the vector
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next non-delimiter
        pos = str.find_first_of(delimiters, lastPos);
    }
}

void split2(vector<string> &tokens, const string &str, char delim=' ') {
    stringstream ss(str); //convert string to stream
    string item;
    while(getline(ss, item, delim)) {
        tokens.push_back(item); //add token to vector
    }
}

int main() {
    string input_line;
    vector<string> spline;
    long count = 0;
    int sec, lps;
    time_t start = time(NULL);

    cin.sync_with_stdio(false); //disable synchronous IO

    while(cin) {
        getline(cin, input_line);
        spline.clear(); //empty the vector for the next line to parse

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        split2(spline, input_line);

        count++;
    };

    count--; //subtract for final over-read
    sec = (int) time(NULL) - start;
    cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp

请注意,我尝试了两种不同的拆分实现 . 一个(split1)使用字符串方法来搜索令牌,并且能够合并多个令牌以及处理多个令牌(它来自here) . 第二个(split2)使用getline将字符串作为流读取,不合并分隔符,并且仅支持单个deimeter字符(一个是由几个StackOverflow用户在字符串拆分问题的答案中发布的) .

我以各种顺序多次运行这个 . 我的测试机是Macbook Pro(2011,8GB,四核),并不重要 . 我正在测试一个带有三个空格分隔列的20M行文本文件,每个列看起来类似于:“foo.bar 127.0.0.1 home.foo.bar”

Results:

$ /usr/bin/time cat test_lines_double | ./split.py
       15.61 real         0.01 user         0.38 sys
Python: Saw 20000000 lines in 15 seconds.   Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
       23.50 real         0.01 user         0.46 sys
C++   : Saw 20000000 lines in 23 seconds.  Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
       44.69 real         0.02 user         0.62 sys
C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444

我究竟做错了什么?有没有更好的方法在C中进行字符串拆分,不依赖于外部库(即没有提升),支持合并分隔符序列(如python的拆分),是线程安全的(所以没有strtok),并且其性能至少是与python相提并论?

Edit 1 / Partial Solution?:

我尝试通过让python重置虚拟列表并每次附加到它来进行更公平的比较,就像C一样 . 这仍然不是C代码正在做的事情,但它更接近 . 基本上,循环现在是:

for line in sys.stdin:
    dummy = []
    dummy += line.split()
    count += 1

python的性能现在与split1 C实现大致相同 .

/usr/bin/time cat test_lines_double | ./split5.py
       22.61 real         0.01 user         0.40 sys
Python: Saw 20000000 lines in 22 seconds.   Crunch Speed: 909090

我仍然感到惊讶的是,即使Python如此优化字符串处理(如Matt Joiner建议),这些C实现也不会更快 . 如果有人想知道如何使用C以更优化的方式执行此操作,请分享您的代码 . (我认为我的下一步将尝试在纯C中实现这一点,虽然我不打算用程序员的 生产环境 力来重新实现我在C中的整个项目,所以这只是一个字符串拆分速度的实验 . )

感谢大家的帮助 .

Final Edit/Solution:

请参阅Alf的接受答案 . 由于python严格通过引用处理字符串,并且经常复制STL字符串,因此使用vanilla python实现时性能会更好 . 为了比较,我通过Alf的代码编译和运行我的数据,这是与所有其他运行在同一台机器上的性能,基本上与天真的python实现相同(虽然比重置/附加列表的python实现更快,如如上所述编辑):

$ /usr/bin/time cat test_lines_double | ./split6
       15.09 real         0.01 user         0.45 sys
C++   : Saw 20000000 lines in 15 seconds.  Crunch speed: 1333333

我唯一剩下的小抱怨是关于在这种情况下让C执行所需的代码量 .

这个问题的一个教训和昨天的stdin读行问题(上面链接的)是一个应该总是基准而不是对语言的相对“默认”性能做出天真的假设 . 我很欣赏这种教育 .

再次感谢所有人的建议!

8 回答

  • 2

    作为猜测,Python字符串是引用计数的不可变字符串,因此在Python代码中不会复制任何字符串,而C std::string 是可变值类型,并以最小的机会复制 .

    如果目标是快速拆分,则可以使用常量时间子串操作,这意味着仅引用原始字符串的部分,如Python(以及Java和C#...) .

    C std::string 类有一个兑换功能:它是标准的,因此它可以用来安全地和可移植地传递字符串,效率不是主要考虑因素 . 但足够的聊天 . 代码 - 在我的机器上,这当然比Python快,因为Python的字符串处理是在C中实现的,C是C的一个子集(他是):

    #include <iostream>                                                              
    #include <string>
    #include <sstream>
    #include <time.h>
    #include <vector>
    
    using namespace std;
    
    class StringRef
    {
    private:
        char const*     begin_;
        int             size_;
    
    public:
        int size() const { return size_; }
        char const* begin() const { return begin_; }
        char const* end() const { return begin_ + size_; }
    
        StringRef( char const* const begin, int const size )
            : begin_( begin )
            , size_( size )
        {}
    };
    
    vector<StringRef> split3( string const& str, char delimiter = ' ' )
    {
        vector<StringRef>   result;
    
        enum State { inSpace, inToken };
    
        State state = inSpace;
        char const*     pTokenBegin = 0;    // Init to satisfy compiler.
        for( auto it = str.begin(); it != str.end(); ++it )
        {
            State const newState = (*it == delimiter? inSpace : inToken);
            if( newState != state )
            {
                switch( newState )
                {
                case inSpace:
                    result.push_back( StringRef( pTokenBegin, &*it - pTokenBegin ) );
                    break;
                case inToken:
                    pTokenBegin = &*it;
                }
            }
            state = newState;
        }
        if( state == inToken )
        {
            result.push_back( StringRef( pTokenBegin, &*str.end() - pTokenBegin ) );
        }
        return result;
    }
    
    int main() {
        string input_line;
        vector<string> spline;
        long count = 0;
        int sec, lps;
        time_t start = time(NULL);
    
        cin.sync_with_stdio(false); //disable synchronous IO
    
        while(cin) {
            getline(cin, input_line);
            //spline.clear(); //empty the vector for the next line to parse
    
            //I'm trying one of the two implementations, per compilation, obviously:
    //        split1(spline, input_line);  
            //split2(spline, input_line);
    
            vector<StringRef> const v = split3( input_line );
            count++;
        };
    
        count--; //subtract for final over-read
        sec = (int) time(NULL) - start;
        cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
        if (sec > 0) {
            lps = count / sec;
            cerr << "  Crunch speed: " << lps << endl;
        } else
            cerr << endl;
        return 0;
    }
    
    //compiled with: g++ -Wall -O3 -o split1 split_1.cpp -std=c++0x
    

    免责声明:我希望没有任何错误 . 我没有测试过功能,但只检查了速度 . 但我认为,即使存在一两个错误,纠正也不会显着影响速度 .

  • 0

    我没有提供任何更好的解决方案(至少在性能方面),但一些额外的数据可能很有趣 .

    使用 strtok_rstrtok 的可重入变体):

    void splitc1(vector<string> &tokens, const string &str,
            const string &delimiters = " ") {
        char *saveptr;
        char *cpy, *token;
    
        cpy = (char*)malloc(str.size() + 1);
        strcpy(cpy, str.c_str());
    
        for(token = strtok_r(cpy, delimiters.c_str(), &saveptr);
            token != NULL;
            token = strtok_r(NULL, delimiters.c_str(), &saveptr)) {
            tokens.push_back(string(token));
        }
    
        free(cpy);
    }
    

    另外使用参数的字符串,输入 fgets

    void splitc2(vector<string> &tokens, const char *str,
            const char *delimiters) {
        char *saveptr;
        char *cpy, *token;
    
        cpy = (char*)malloc(strlen(str) + 1);
        strcpy(cpy, str);
    
        for(token = strtok_r(cpy, delimiters, &saveptr);
            token != NULL;
            token = strtok_r(NULL, delimiters, &saveptr)) {
            tokens.push_back(string(token));
        }
    
        free(cpy);
    }
    

    并且,在某些情况下,可以接受销毁输入字符串:

    void splitc3(vector<string> &tokens, char *str,
            const char *delimiters) {
        char *saveptr;
        char *token;
    
        for(token = strtok_r(str, delimiters, &saveptr);
            token != NULL;
            token = strtok_r(NULL, delimiters, &saveptr)) {
            tokens.push_back(string(token));
        }
    }
    

    这些的时间安排如下(包括我对问题中其他变体的结果和接受的答案):

    split1.cpp:  C++   : Saw 20000000 lines in 31 seconds.  Crunch speed: 645161
    split2.cpp:  C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444
    split.py:    Python: Saw 20000000 lines in 33 seconds.  Crunch Speed: 606060
    split5.py:   Python: Saw 20000000 lines in 35 seconds.  Crunch Speed: 571428
    split6.cpp:  C++   : Saw 20000000 lines in 18 seconds.  Crunch speed: 1111111
    
    splitc1.cpp: C++   : Saw 20000000 lines in 27 seconds.  Crunch speed: 740740
    splitc2.cpp: C++   : Saw 20000000 lines in 22 seconds.  Crunch speed: 909090
    splitc3.cpp: C++   : Saw 20000000 lines in 20 seconds.  Crunch speed: 1000000
    

    我们可以看到,接受答案的解决方案仍然是最快的 .

    对于任何想要进行进一步测试的人,我还提出了一个Github回购,其中包含问题中的所有程序,接受的答案,这个答案以及另外一个Makefile以及生成测试数据的脚本:https://github.com/tobbez/string-splitting .

  • 9

    我怀疑这是因为 std::vector 在push_back()函数调用过程中调整大小的方式 . 如果您尝试使用 std::liststd::vector::reserve() 为句子保留足够的空间,则应该可以获得更好的性能 . 或者您可以将下面两者的组合用于split1():

    void split1(vector<string> &tokens, const string &str,
            const string &delimiters = " ") {
        // Skip delimiters at beginning
        string::size_type lastPos = str.find_first_not_of(delimiters, 0);
    
        // Find first non-delimiter
        string::size_type pos = str.find_first_of(delimiters, lastPos);
        list<string> token_list;
    
        while (string::npos != pos || string::npos != lastPos) {
            // Found a token, add it to the list
            token_list.push_back(str.substr(lastPos, pos - lastPos));
            // Skip delimiters
            lastPos = str.find_first_not_of(delimiters, pos);
            // Find next non-delimiter
            pos = str.find_first_of(delimiters, lastPos);
        }
        tokens.assign(token_list.begin(), token_list.end());
    }
    

    编辑:我看到的另一个显而易见的事情是Python变量 dummy 每次都被分配但没有被修改 . 所以这不是与C的公平比较 . 您应该尝试将Python代码修改为 dummy = [] 以初始化它,然后执行 dummy += line.split() . 你可以在此之后报告运行时吗?

    EDIT2 :为了使它更公平,你可以修改C代码中的while循环:

    while(cin) {
            getline(cin, input_line);
            std::vector<string> spline; // create a new vector
    
            //I'm trying one of the two implementations, per compilation, obviously:
    //        split1(spline, input_line);  
            split2(spline, input_line);
    
            count++;
        };
    
  • 2

    我认为以下代码更好,使用一些C 17和C 14功能:

    // These codes are un-tested when I write this post, but I'll test it
    // When I'm free, and I sincerely welcome others to test and modify this
    // code.
    
    // C++17
    #include <istream>     // For std::istream.
    #include <string_view> // new feature in C++17, sizeof(std::string_view) == 16 in libc++ on my x86-64 debian 9.4 computer.
    #include <string>
    #include <utility>     // C++14 feature std::move.
    
    template <template <class...> class Container, class Allocator>
    void split1(Container<std::string_view, Allocator> &tokens, 
                std::string_view str,
                std::string_view delimiter = " ") 
    {
        /* 
         * The model of the input string:
         *
         * (optional) delimiter | content | delimiter | content | delimiter| 
         * ... | delimiter | content 
         *
         * Using std::string::find_first_not_of or 
         * std::string_view::find_first_not_of is a bad idea, because it 
         * actually does the following thing:
         * 
         *     Finds the first character not equal to any of the characters 
         *     in the given character sequence.
         * 
         * Which means it does not treeat your delimiters as a whole, but as
         * a group of characters.
         * 
         * This has 2 effects:
         *
         *  1. When your delimiters is not a single character, this function
         *  won't behave as you predicted.
         *
         *  2. When your delimiters is just a single character, the function
         *  may have an additional overhead due to the fact that it has to 
         *  check every character with a range of characters, although 
         * there's only one, but in order to assure the correctness, it still 
         * has an inner loop, which adds to the overhead.
         *
         * So, as a solution, I wrote the following code.
         *
         * The code below will skip the first delimiter prefix.
         * However, if there's nothing between 2 delimiter, this code'll 
         * still treat as if there's sth. there.
         *
         * Note: 
         * Here I use C++ std version of substring search algorithm, but u
         * can change it to Boyer-Moore, KMP(takes additional memory), 
         * Rabin-Karp and other algorithm to speed your code.
         * 
         */
    
        // Establish the loop invariant 1.
        typename std::string_view::size_type 
            next, 
            delimiter_size = delimiter.size(),  
            pos = str.find(delimiter) ? 0 : delimiter_size;
    
        // The loop invariant:
        //  1. At pos, it is the content that should be saved.
        //  2. The next pos of delimiter is stored in next, which could be 0
        //  or std::string_view::npos.
    
        do {
            // Find the next delimiter, maintain loop invariant 2.
            next = str.find(delimiter, pos);
    
            // Found a token, add it to the vector
            tokens.push_back(str.substr(pos, next));
    
            // Skip delimiters, maintain the loop invariant 1.
            //
            // @ next is the size of the just pushed token.
            // Because when next == std::string_view::npos, the loop will
            // terminate, so it doesn't matter even if the following 
            // expression have undefined behavior due to the overflow of 
            // argument.
            pos = next + delimiter_size;
        } while(next != std::string_view::npos);
    }   
    
    template <template <class...> class Container, class traits, class Allocator2, class Allocator>
    void split2(Container<std::basic_string<char, traits, Allocator2>, Allocator> &tokens, 
                std::istream &stream,
                char delimiter = ' ')
    {
        std::string<char, traits, Allocator2> item;
    
        // Unfortunately, std::getline can only accept a single-character 
        // delimiter.
        while(std::getline(stream, item, delimiter))
            // Move item into token. I haven't checked whether item can be 
            // reused after being moved.
            tokens.push_back(std::move(item));
    }
    

    容器的选择:

    • std::vector .

    假设分配的内部数组的初始大小为1,并且最终大小为N,则将为log2(N)次分配和释放,并且您将复制(2 ^(log2(N)1) - 1)=( 2N - 1)次 . 正如在Is the poor performance of std::vector due to not calling realloc a logarithmic number of times?中所指出的,当向量的大小不可预测并且可能非常大时,这可能具有差的性能 . 但是,如果你可以估计它的大小,这将不再是一个问题 .

    • std::list .

    对于每个push_back,它消耗的时间是一个常量,但它可能比单独的push_back上的std :: vector花费更多的时间 . 使用每线程内存池和自定义分配器可以缓解此问题 .

    • std::forward_list .

    与std :: list相同,但每个元素占用的内存较少 . 由于缺少API push_back,需要使用包装器类 .

    • std::array .

    如果您可以知道增长的限制,那么您可以使用std :: array . 当然,你不能直接使用它,因为它没有API push_back . 但是你可以定义一个包装器,我认为这是最快的方法,如果你的估计非常准确,可以节省一些内存 .

    • std::deque .

    此选项允许您交换内存以提高性能 . 元素的复制没有(2 ^(N 1) - 1)次,只有N次分配,没有解除分配 . 此外,您将拥有恒定的随机访问时间,并且能够在两端添加新元素 .

    根据std::deque-cppreference

    另一方面,deques通常具有较大的最小内存成本;只保存一个元素的双端队列必须分配其完整的内部数组(例如,64位libstdc上的对象大小的8倍;对象大小的16倍或64位libc中的4096字节,以较大者为准)

    或者您可以使用以下组合:

    • std::vector< std::array<T, 2 ^ M> >

    这类似于std :: deque,区别在于这个容器不支持在前面添加元素 . 但它的性能仍然更快,因为它不会复制底层的std :: array(2 ^(N 1) - 1)次,它只是复制指针数组(2 ^( N - M 1) - 1)次,并且仅在当前已满并且不需要解除分配任何内容时分配新数组 . 顺便说一句,您可以获得恒定的随机访问时间 .

    • std::list< std::array<T, ...> >

    大大缓解了内存碎片的压力 . 它只会在当前已满时分配新数组,并且不需要复制任何内容 . 您仍然需要支付与组合1相比的额外指针的价格 .

    • std::forward_list< std::array<T, ...> >

    与2相同,但与组合1相同的内存 .

  • 4

    你're making the mistaken assumption that your chosen C++ implementation is necessarily faster than Python' s . Python中的字符串处理是高度优化的 . 有关更多信息,请参阅此问题:Why do std::string operations perform poorly?

  • 53

    如果你采用split1实现并将签名更改为与split2更接近,则通过更改:

    void split1(vector<string> &tokens, const string &str, const string &delimiters = " ")
    

    对此:

    void split1(vector<string> &tokens, const string &str, const char delimiters = ' ')
    

    您在split1和split2之间获得了更大的差异,并且更公平的比较:

    split1  C++   : Saw 10000000 lines in 41 seconds.  Crunch speed: 243902
    split2  C++   : Saw 10000000 lines in 144 seconds.  Crunch speed: 69444
    split1' C++   : Saw 10000000 lines in 33 seconds.  Crunch speed: 303030
    
  • 1
    void split5(vector<string> &tokens, const string &str, char delim=' ') {
    
        enum { do_token, do_delim } state = do_delim;
        int idx = 0, tok_start = 0;
        for (string::const_iterator it = str.begin() ; ; ++it, ++idx) {
            switch (state) {
                case do_token:
                    if (it == str.end()) {
                        tokens.push_back (str.substr(tok_start, idx-tok_start));
                        return;
                    }
                    else if (*it == delim) {
                        state = do_delim;
                        tokens.push_back (str.substr(tok_start, idx-tok_start));
                    }
                    break;
    
                case do_delim:
                    if (it == str.end()) {
                        return;
                    }
                    if (*it != delim) {
                        state = do_token;
                        tok_start = idx;
                    }
                    break;
            }
        }
    }
    
  • 3

    我怀疑这与Python中的sys.stdin上的缓冲有关,但是在C实现中没有缓冲 .

    有关如何更改缓冲区大小的详细信息,请参阅此帖子,然后再次尝试比较:Setting smaller buffer size for sys.stdin?

相关问题