使用CUDA减少矩阵行-Java 学习之路

Windows 7, NVidia GeForce 425M.

我写了一个简单的CUDA代码来计算矩阵的行和 . 矩阵具有单维表示（指向浮点的指针） .

代码的串行版本如下（它具有 2 循环，如预期的那样）：

void serial_rowSum (float* m, float* output, int nrow, int ncol) {
    float sum;
    for (int i = 0 ; i < nrow ; i++) {
        sum = 0;
        for (int j = 0 ; j < ncol ; j++)
            sum += m[i*ncol+j];
        output[i] = sum;
    }
}

在CUDA代码中，我调用内核函数按行扫描矩阵 . 下面是内核调用片段：

dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock)); 

kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);

以及执行行的并行求和的内核函数（仍然有 1 循环）：

__global__ void kernel_rowSum(float *m, float *s, int nrow, int ncol) {

    int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;

    if (rowIdx < nrow) {
        float sum=0;
        for (int k = 0 ; k < ncol ; k++)
            sum+=m[rowIdx*ncol+k];
        s[rowIdx] = sum;            
    }

}

到现在为止还挺好 . 串行和并行（CUDA）结果相同 .

重点是CUDA版本花费的时间几乎是串口计算的两倍，即使我更改了 nThreadsPerBlock 参数：我测试 nThreadsPerBlock 从 32 到 1024 （我的卡允许的每个块的最大线程数） .

IMO，矩阵维度足以证明并行化的合理性： 90,000 x 1,000 .

下面，我将报告使用不同 nThreadsPerBlock 的串行和并行版本所用的时间 . 在 msec 中报告的时间平均为 100 样本：

矩阵：nrow = 90000 x ncol = 1000

序列号：每个样本经过的平均时间，单位为毫秒（ 100 样本）： 289.18 .

CUDA（ 32 ThreadsPerBlock）：每个样本经过的平均时间，以毫秒为单位（ 100 样本）： 497.11 .

CUDA（ 1024 ThreadsPerBlock）：每个样本经过的平均时间，以毫秒为单位（ 100 样本）： 699.66 .

以防万一， 32 / 1024 nThreadsPerBlock 的版本是最快/最慢的版本 .

我知道从主机到设备复制时有一种开销，反之亦然，但可能是因为我没有实现最快的代码 .

因为我远不是一个CUDA专家：

Am I coding the fastest version for this task? How could I improve my code? Can I get rid of the loop in the kernel function?

任何想法都赞赏 .

编辑1

虽然我描述了标准 rowSum ，但我对 AND / OR 对具有 (0;1} 值的行的操作感兴趣，例如 rowAND / rowOR . 也就是说，正如一些评论家所建议的那样，它不允许我利用 1 的 COL 列向量技巧 .

编辑2

用户建议其他用户并在此认可：

FORGET ABOUT TRYING TO WRITE YOUR OWN FUNCTIONS ，使用Thrust库而魔法来了 .

3 回答

13
既然你提到你需要除了和之外的一般还原算法 . 我会尝试在这里提供3种方法 . 内核方法可能具有最高性能 . 推力方法最容易实现 . cuBLAS方法仅适用于sum并且具有良好的性能 .

内核方法

Here's a very good doc介绍如何优化标准并行缩减 . 标准减少可分为两个阶段 .
- 多个线程块各减少一部分数据;
- 一个线程块从第1阶段的结果减少到最后的1个元素 .
对于多减少（减少垫子的行）问题，只有阶段1就足够了 . 这个想法是每个线程块减少1行 . 有关每个线程块多行或每个多线程块1行的进一步考虑，可以参考paper provided by @Novak . 这可以更好地改善性能，特别是对于形状不好的矩阵 .

推力方法

一般的多次减少可以在几分钟内通过 thrust::reduction_by_key 完成 . 你可以在这里找到一些讨论Determining the least element and its position in each matrix column with CUDA Thrust .

但是 thrust::reduction_by_key 并不假设每行具有相同的长度，因此您将获得性能损失 . 另一篇文章How to normalize matrix columns in CUDA with max performance?给出了 thrust::reduction_by_key 和行总和的cuBLAS方法之间的分析比较 . 它可以让您对性能有基本的了解 .

cuBLAS方法

矩阵A的行/列的总和可以看作矩阵向量乘法，其中向量的元素都是1 . 它可以用以下matlab代码表示 .
```
y = A * ones(size(A,2),1);
```
其中 y 是A行的总和 .

cuBLAS libary为此操作提供高性能矩阵向量乘法函数cublas<t>gemv() .

时序结果表明，该程序仅比简单读取A的所有元素慢10~50％，这可以看作是该操作性能的理论上限 .
回复于 2024-05-17T00:14:15+08:00
3
如果这是您需要对此数据执行的操作的范围（对行进行求和），我不希望GPU获得相当大的好处 . 每个数据元素只有一个算术运算，为此您需要支付将该数据元素传输到GPU的成本 . 超出一定的问题规模（无论如何保持因为算术强度为O（n），所以你不会从较大的问题规模中获得额外的好处 .

因此，这不是GPU上需要解决的特别激动人心的问题 .

但正如talonmies所指出的那样，你制作它的方式存在一个合并的问题，这会进一步降低速度 . 我们来看一个小例子：
```
C1  C2  C3  C4
R1  11  12  13  14
R2  21  22  23  24
R3  31  32  33  34
R4  41  42  43  44
```
以上是矩阵的一小部分的简单图示示例 . 机器数据存储使得元件（11），（12），（13）和（14）存储在相邻的存储器位置中 .

对于合并访问，我们需要一种访问模式，以便从相同的指令请求相邻的存储器位置，在整个warp上执行 .

我们需要考虑从warp的角度执行代码，即在锁定步骤中执行的32个线程 . 你的代码在做什么？它在每个步骤/指令中检索（要求）哪些元素？我们来看看这行代码：
```
sum+=m[rowIdx*ncol+k];
```
当您创建该变量时，warp中的相邻线程具有 rowIdx 的相邻（即连续）值 . 所以当 k = 0时，当我们尝试检索值 m[rowIdx*ncol+k] 时，每个线程都会询问哪个数据元素？

在块0中，线程0的 rowIdx 为0.线程1的 rowIdx 为1，等等 . 因此，此指令中每个线程要求的值为：
```
Thread:   Memory Location:    Matrix Element:
     0      m[0]                   (11)
     1      m[ncol]                (21)
     2      m[2*ncol]              (31)
     3      m[3*ncol]              (41)
```
但这不是合并访问！元件（11），（21）等在存储器中不相邻 . 对于合并访问，我们希望Matrix Element行如下所示：
```
Thread:   Memory Location:    Matrix Element:
     0      m[?]                   (11)
     1      m[?]                   (12)
     2      m[?]                   (13)
     3      m[?]                   (14)
```
如果您然后向后工作以确定 ? 的值应该是什么，那么您将得到类似这样的指令：
```
sum+=m[k*ncol+rowIdx];
```
这将提供合并访问，但它不会给你正确的答案，因为我们现在总结矩阵列而不是矩阵行 . 我们可以通过将数据存储重新组织为列主要顺序而不是行主顺序来解决此问题 . （你应该可以谷歌那个想法，对吗？）从概念上讲，这相当于转置你的矩阵 m . 这对您来说是否方便，超出了您的问题的范围，正如我所看到的那样，而不是真正的CUDA问题 . 当您在主机上创建矩阵或将矩阵从主机传输到设备时，您可能会做一件简单的事情 . 但总的来说，如果矩阵以行主顺序存储，我不知道如何将矩阵行与100％合并访问相加 . （您可以使用一系列行减少，但这看起来很痛苦 . ）

当我们考虑在GPU上加速代码的方法时，考虑重新组织我们的数据存储以促进GPU，这并不罕见 . 这是一个例子 .

而且，是的，我在这里概述的内容仍然保留了内核中的循环 .

作为补充说明，我建议分别对数据复制部分和内核（计算）部分进行计时 . 我无法从您的问题中判断出您是仅计时内核还是整个（GPU）操作，包括数据副本 . 如果单独计算数据副本的时间，您可能会发现只有数据复制时间超过了CPU时间 . 优化CUDA代码的任何努力都不会影响数据复制时间 . 在您花费大量时间之前，这可能是一个有用的数据点 .
回复于 2024-05-17T00:14:15+08:00

Reducing the rows of a matrix 可以通过三种方式使用 CUDA Thrust 来解决（它们可能不是唯一的方法，但解决这一问题超出了范围） . 同样的OP也认识到，使用CUDA Thrust对于这种问题是优选的 . 此外，使用 cuBLAS 的方法也是可行的 .

APPROACH #1 - reduce_by_key

这是Thrust example page建议的方法 . 它包含使用 make_discard_iterator 的变体 .

APPROACH #2 - transform

这是Robert Crovella在CUDA Thrust: reduce_by_key on only some values in an array, based off values in a “key” array提出的方法 .

APPROACH #3 - inclusive_scan_by_key

这是Eric在How to normalize matrix columns in CUDA with max performance?建议的方法 .

APPROACH #4 - cublas<t>gemv

它使用 cuBLAS gemv 将相关矩阵乘以 1 的列 .

THE FULL CODE

这是缩小两种方法的代码 . Utilities.cu 和 Utilities.cuh 文件被保留here并在此处省略 . TimingGPU.cu 和 TimingGPU.cuh 保持here并且也被省略 .

#include <cublas_v2.h>

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>

#include <stdio.h>
#include <iostream>

#include "Utilities.cuh"
#include "TimingGPU.cuh"

// --- Required for approach #2
__device__ float *vals;

/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {

    T Ncols; // --- Number of columns

    __host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}

    __host__ __device__ T operator()(T i) { return i / Ncols; }
};

/******************************************/
/* ROW_REDUCTION - NEEDED FOR APPROACH #2 */
/******************************************/
struct row_reduction {

    const int Ncols;    // --- Number of columns

    row_reduction(int _Ncols) : Ncols(_Ncols) {}

    __device__ float operator()(float& x, int& y ) {
        float temp = 0.f;
        for (int i = 0; i<Ncols; i++)
            temp += vals[i + (y*Ncols)];
        return temp;
    }
};

/**************************/
/* NEEDED FOR APPROACH #3 */
/**************************/
template<typename T>
struct MulC: public thrust::unary_function<T, T>
{
    T C;
    __host__ __device__ MulC(T c) : C(c) { }
    __host__ __device__ T operator()(T x) { return x * C; }
};

/********/
/* MAIN */
/********/
int main()
{
    const int Nrows = 5;     // --- Number of rows
    const int Ncols = 8;     // --- Number of columns

    // --- Random uniform integer distribution between 10 and 99
    thrust::default_random_engine rng;
    thrust::uniform_int_distribution<int> dist(10, 99);

    // --- Matrix allocation and initialization
    thrust::device_vector<float> d_matrix(Nrows * Ncols);
    for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);

    TimingGPU timerGPU;

    /***************/
    /* APPROACH #1 */
    /***************/
    timerGPU.StartCounter();
    // --- Allocate space for row sums and indices
    thrust::device_vector<float> d_row_sums(Nrows);
    thrust::device_vector<int> d_row_indices(Nrows);

    // --- Compute row sums by summing values with equal row indices
    //thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)),
    //                    thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
    //                    d_matrix.begin(),
    //                    d_row_indices.begin(),
    //                    d_row_sums.begin(),
    //                    thrust::equal_to<int>(),
    //                    thrust::plus<float>());

    thrust::reduce_by_key(
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
                d_matrix.begin(),
                thrust::make_discard_iterator(),
                d_row_sums.begin());

    printf("Timing for approach #1 = %f\n", timerGPU.GetCounter());

    // --- Print result
    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums[i] << "\n";
    }

    /***************/
    /* APPROACH #2 */
    /***************/
    timerGPU.StartCounter();
    thrust::device_vector<float> d_row_sums_2(Nrows, 0);
    float *s_vals = thrust::raw_pointer_cast(&d_matrix[0]);
    gpuErrchk(cudaMemcpyToSymbol(vals, &s_vals, sizeof(float *)));
    thrust::transform(d_row_sums_2.begin(), d_row_sums_2.end(), thrust::counting_iterator<int>(0),  d_row_sums_2.begin(), row_reduction(Ncols));

    printf("Timing for approach #2 = %f\n", timerGPU.GetCounter());

    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums_2[i] << "\n";
    }

    /***************/
    /* APPROACH #3 */
    /***************/

    timerGPU.StartCounter();
    thrust::device_vector<float> d_row_sums_3(Nrows, 0);
    thrust::device_vector<float> d_temp(Nrows * Ncols);
    thrust::inclusive_scan_by_key(
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
                d_matrix.begin(),
                d_temp.begin());
    thrust::copy(
                thrust::make_permutation_iterator(
                        d_temp.begin() + Ncols - 1,
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))),
    thrust::make_permutation_iterator(
                        d_temp.begin() + Ncols - 1,
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))) + Nrows,
                d_row_sums_3.begin());

    printf("Timing for approach #3 = %f\n", timerGPU.GetCounter());

    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums_3[i] << "\n";
    }

    /***************/
    /* APPROACH #4 */
    /***************/
    cublasHandle_t handle;

    timerGPU.StartCounter();
    cublasSafeCall(cublasCreate(&handle));

    thrust::device_vector<float> d_row_sums_4(Nrows);
    thrust::device_vector<float> d_ones(Ncols, 1.f);

    float alpha = 1.f;
    float beta  = 0.f;
    cublasSafeCall(cublasSgemv(handle, CUBLAS_OP_T, Ncols, Nrows, &alpha, thrust::raw_pointer_cast(d_matrix.data()), Ncols, 
                               thrust::raw_pointer_cast(d_ones.data()), 1, &beta, thrust::raw_pointer_cast(d_row_sums_4.data()), 1));

    printf("Timing for approach #4 = %f\n", timerGPU.GetCounter());

    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums_4[i] << "\n";
    }

    return 0;
}

TIMING RESULTS （在开普勒K20c上测试过）

Matrix size       #1     #1-v2     #2     #3     #4     #4 (no plan)
100  x 100        0.63   1.00     0.10    0.18   139.4  0.098
1000 x 1000       1.25   1.12     3.25    1.04   101.3  0.12
5000 x 5000       8.38   15.3     16.05   13.8   111.3  1.14

 100 x 5000       1.25   1.52     2.92    1.75   101.2  0.40    

5000 x 100        1.35   1.99     0.37    1.74   139.2  0.14

除少数列的情况外，方法＃1和＃3似乎优于方法＃2 . 然而，最好的方法是方法＃4，它比其他方法更方便，只要创建时间所需的时间计划可以在计算期间摊销 .

回复于 2024-05-17T00:14:15+08:00

使用CUDA减少矩阵行

编辑1

编辑2

3 回答

内核方法

推力方法

cuBLAS方法

相关问题