优化并行循环性能-Java 学习之路

我有一个场景，我需要迭代6400万个组合，并且每个组合都为64,000项数据执行相同类型的处理逻辑 .

我注意到，根据我如何配置循环逻辑 - 性能，即使在并行循环中，也可以减少或增加 .

以下是3种情况：

常见数据：

int numberofSets = 3;

int set1ElementCount = 5840;
int set2ElementCount = 5840;
int set3ElementCount = 2;

int combinationsCount = 68211200; // = 5840 * 5840 * 2
int dataCount = 64000;

Parallel execution

码：

int[,] combinations = new int[combinationsCount, numberofSets];
// combinations = generator.Generate(); // generate combinations
/* generated format is:
[0,0,0]
[0,0,1]
[1,0,0]
...
[5839, 5839, 1]
*/
//itterate combinations
Parallel.For(0, combinationsCount, (idx, state) =>
{
    int idx1 = combinations[idx, 0]; // a bit of hardcoding here since we have 3 sets of data
    int idx2 = combinations[idx, 1];
    int idx3 = combinations[idx, 3];

    // proccess data set for each combination
    for (int i = 0; i < dataCount; i++) {
        // do something
    }
});

结果：

〜15分钟处理1％的数据
Parallel execution 2

码：

// itterate set 1 in parallel
Parallel.For(0, set1ElementCount, (idx1, state) =>
{
    // itterate set 2
    for (int idx2 = 0; idx2  < set2ElementCount; idx2 ++)
    {
        // itterate set 3
        for (int idx3 = 0; idx3  < set3ElementCount; idx3 ++)
        {
            // proccess data set for each combination
            for (int i = 0; i < dataCount; i++)
            {
                // do something
            }
        }
    }
});

结果：

~10分钟处理1％的数据
Single-threaded execution

码：

// itterate set 1
for (int idx1 = 0; idx1 < set1ElementCount; idx1 ++)
{
    // itterate set 2
    for (int idx2 = 0; idx2  < set2ElementCount; idx2 ++)
    {
        // itterate set 3
        for (int idx3 = 0; idx3  < set3ElementCount; idx3 ++)
        {
            // proccess data set for each combination
            for(int i = 0; i < dataCount; i++)
            {
                // do something
            }
        }
    }
}

结果：

超过25分钟处理1％的数据

单线程方法是最慢的，这是可以理解的 . 我主要关注的是多线程性能和优化 .

我可以解释1和2之间的执行差异，因为它需要一些时间来生成一个新线程 . 因此，即使快速处理数据本身 - 创建和删除线程仍然可能效率不高 .

第二个结论是-- sometimes 每1个线程放置一个更高的CPU负载是有意义的（如案例2），因此在新线程创建上不会花费任何不必要的时间 .

现在，我们正在提出一个问题 .

为了获得最佳性能，我如何找到每个线程需要处理的此限制或数据量？有没有一种方法可以让我确定 - 一个线程最好做多少工作？

例如，在上面的示例中，我只有3组，其中两组是相同的 . 它可能是10套，每套都有不同的数据 . 在这种情况下，将有100种组合，您可以如何配置循环：

并联设置1 - 其余为单螺纹组2并联 - 其余为单螺纹组1 2并联 - 其余为单螺纹...

使用某种其他方法获得最佳循环性能是否有意义？线程队列？任务？

Edit: 正如@arekzyla所建议的那样 - 我试图将数据拆分成块！为此，我需要将组合数组类型从 jagged 更改为 multidimensional ，然后执行分块逻辑：

var combinationsCount = combinations.Length;
int coreCount = 4;
int chuncSize = combinationsCount / coreCount;
List<int[][]> chunked = new List<int[][]>();
for (int i = 0; i < coreCount; i++)
{
    int skip = i * chuncSize;
    int take = chuncSize;

    int diff = (combinationsCount - skip) - take;
    if (diff < chuncSize)
        take = take + diff;

    var sub = combinations.Skip(skip).Take(take).ToArray();

    chunked.Add(sub);
}

// iterate chunks - each on a separate core
Parallel.For(0, coreCount, new ParallelOptions() { MaxDegreeOfParallelism = coreCount }, (chunkIndex, state) =>
{
    var chunk = chunked[chunkIndex];
    int chunkLength = chunk.Length;

    // iterate combinations per-chunk
    for (int idx = 0; idx < chunkLength; idx++)
    {
        // itterate data here
    }
}

结果：

~9分钟

嗯，一点改进仍然是一个改进 . 虽然，我希望有更多的东西 .

优化并行循环性能

相关问题