我目前正在为一个大矩阵的每个值计算一个小数量(数百万行,列数<1000),同时独立地考虑每一行 .
更确切地说,对于该矩阵的每行i,列j中的每个值M(i,j),数量仅为[M(i,j) - 均值(i,s)] / std(i,s),其中s是M(i,:) - j中的子集s,换句话说,s是没有值j的第i行的所有值的子集 .
我比较了两个实现,一个是C风格的数组,另一个是Armadillo,而Armadillo的执行时间大约是两倍 . 我希望执行时间类似或稍微慢一点,但普通的C阵列似乎可以显着提高性能 .
我错过了什么特别的理由或某事吗?以下是使用以下命令编译的示例: -O2 -lstdc++ -DARMA_DONT_USE_WRAPPER -lopenblas -llapack -lm
. 也尝试使用 ARMA_NO_DEBUG
但没有成功 .
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>
using namespace std::chrono;
/***************************
* main()
***************************/
int main( int argc, char *argv[] )
{
unsigned nrows = 2000000; //number of rows
unsigned ncols = 100; //number of cols
const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix
const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]
arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.
arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.
high_resolution_clock::time_point t1 = high_resolution_clock::now();
//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < nrows; i++) {
const arma::rowvec current_line = huge_mat.row(i); //extract current line
//for each observation in current_line:
for(unsigned j=0; j < ncols; j++) {
//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, ncols-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );
//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secs\n";
//------------------PLAIN C Array
double *Mat_full;
double *output;
unsigned int i,j,k;
double mean=0, stdd=0;
double sq_diff_sum = 0, sum=0;
double diff = 0;
Mat_full = (double *) malloc(ncols * nrows * sizeof(double));
output = (double *) malloc(nrows * ncols * sizeof(double));
std::vector< std::vector<double> > V(huge_mat.n_rows);
//Some UGLY copy from arma::mat to double* using a vector:
for (size_t i = 0; i < huge_mat.n_rows; ++i)
V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));
//then dump to Mat_full array:
for (i=0; i < V.size(); i++)
for (j=0; j < V[i].size(); j++)
Mat_full[i + huge_mat.n_rows * j] = V[i][j];
t1 = high_resolution_clock::now();
for(i=0; i < nrows; i++)
for(j=0; j < ncols; j++)
{
//compute mean of subset-------------------
sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
sum = sum + Mat_full[i+k*nrows];
}
mean = sum / (ncols-1);
//compute standard deviation of subset-----
sq_diff_sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
diff = Mat_full[i+k*nrows] - mean;
sq_diff_sum += diff * diff;
}
stdd = sqrt(sq_diff_sum / (ncols-2));
//export to plain C array:
output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);
}
t2 = high_resolution_clock::now();
duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "C ARRAY: " << duration << " secs\n";
}
特别是在比较执行时间时,对arma :: mean和arma :: stddev的调用似乎表现不佳 . 我没有对尺寸效应对性能进行任何深入分析,但似乎对于小的值 nrows
,普通C往往(非常)更快 . 对于使用此设置的简单测试,我得到:
ARMADILLO: 111 secs
C ARRAY: 79 secs
在执行时间 .
EDIT 这是修改,我们按列而不是按行进行处理,并按照@rubenvb和@mtall的建议独立处理每一列 . 由此产生的执行时间略有减少(现在为 ARMADILLO: 104 secs
),从而显示出在行方面的一些改进:
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>
using namespace std::chrono;
/***************************
* main()
***************************/
int main( int argc, char *argv[] )
{
unsigned nrows = 100; //number of rows
unsigned ncols = 2000000; //number of cols
const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix
const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]
arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.
arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.
high_resolution_clock::time_point t1 = high_resolution_clock::now();
//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < ncols; i++) {
const arma::colvec current_line = huge_mat.col(i); //extract current line
//for each observation in current_line:
for(unsigned j=0; j < nrows; j++) {
//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, nrows-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );
//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secs\n";
}
1 回答
原因是Armadillo使用column-major ordering in mat,而您的C数组使用行主要排序 . 这是一个大问题,因为您的处理器可以使用instruction vectorization一次处理多个元素,这需要连续的内存块 .
要验证这是否是原因,请对列进行相同的计算,而不是行,并检查差异 .