Julia vs MATLAB：为什么我的Julia代码这么慢？-Java 学习之路

我刚开始使用Julia并将我的MATLAB代码翻译成Julia（基本上是逐行） . 我注意到Julia代码要慢得多（比如50x） . 最初的问题是一个动态编程问题，其中我插入了值函数 - 插值是代码大部分时间都在使用的位置 . 所以我尝试制作一个显示性能差异的最小示例代码 . 需要注意的重要事项是，它是插值的样条近似，并且网格最好是不规则的，即不是等间距的 . MATLAB代码：

tic
spacing=1.5;
Nxx = 300; 
Naa = 350;
Nalal = 200; 
sigma = 10;
NoIter = 500;

xx=NaN(Nxx,1);
xmin = 0.01;
xmax = 400;
xx(1) = xmin;
for i=2:Nxx
    xx(i) = xx(i-1) + (xmax-xx(i-1))/((Nxx-i+1)^spacing);
end

f_U = @(c) c.^(1-sigma)/(1-sigma);  
W=NaN(Nxx,1);
W(:,1) = f_U(xx);

xprime = ones(Nalal,Naa);
for i=1:NoIter
     W_temp = interp1(xx,W(:,1),xprime,'spline');
end
toc

Elapsed time is 0.242288 seconds.

朱莉娅代码：

using Dierckx
function performance()

const spacing=1.5 
const Nxx = 300 
const Naa = 350
const Nalal = 200 
const sigma = 10 
const NoIter = 500 

xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
    xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end

f_U(c) =  c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)

W_temp = Array(Float64,Nalal,Naa)

interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
    for j=1:Naa
         for k=1:Nalal
         W_temp[k,j] = evaluate(interp_spline, xprime[k,j])
         end
    end
end

end

@time(performance())
4.200732 seconds (70.02 M allocations: 5.217 GB, 4.35% gc time)

我把W做成了一个二维数组，因为在最初的问题中它是一个矩阵 . 我对不同的插值包进行了一些研究，但对于不规则网格和样条曲线没有太多选择 . MATLAB的interp1显然不可用 .

我的问题是我在想如果我编写Julia代码并且它给出与MATLAB相同的结果，那么Julia应该更快 . 但显然事实并非如此，所以你需要注意你的编码 . 我不是程序员，当然我尽力做到最好，但我想知道我是否在这里做了一些明显的错误，这些错误可以很容易地解决，或者是否会发生（太）经常我必须注意我的朱莉娅编码 - 因为那时我学习它可能不值得 . 同样，如果我能在这里使Julia更快（我非常确定我可以，例如分配看起来有点大），我可能也可以让MATLAB更快 . 我对朱莉娅的希望是 - 对于类似的代码 - 它将比MATLAB运行得更快 .

在对时间进行一些评论后，我还运行了这段代码：

using Dierckx

tic()
const spacing=1.5 
const Nxx = 300 
const Naa = 350
const Nalal = 200 
const sigma = 10 
const NoIter = 500 

xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
     xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end

f_U(c) =  c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)

W_temp = Array(Float64,Nalal,Naa)

interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
    for j=1:Naa
         for k=1:Nalal
         W_temp[k,j] = evaluate(interp_spline, xprime[k,j])
         end
     end
 end
 toc()

elapsed time: 
7.336371495 seconds

甚至慢得多，嗯......

另一个编辑：在这种情况下，消除一个循环实际上使它更快，但仍然无法与MATLAB相比 . 代码：

function performance2()

const spacing=1.5 
const Nxx = 300 
const Naa = 350
const Nalal = 200 
const sigma = 10 
const NoIter = 500 

xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
    xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end

f_U(c) =  c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)

W_temp = Array(Float64,Nalal,Naa)

interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
    for j=1:Naa
         W_temp[:,j] = evaluate(interp_spline, xprime[:,j])
    end
 end

 end

@time(performance2())
1.573347 seconds (700.04 k allocations: 620.643 MB, 1.08% gc time)

另一个编辑，现在通过循环迭代相同的次数：

function performance3()

const spacing=1.5 
const Nxx = 300 
const Naa = 350
const Nalal = 200 
const sigma = 10 
const NoIter = 500 

xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
    xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end

f_U(c) =  c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)

W_temp = Array(Float64,Nalal,Naa)
W_tempr = Array(Float64, Nalal*Naa)

interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
xprimer = reshape(xprime, Nalal*Naa)

for i=1:NoIter
        W_tempr = evaluate(interp_spline, xprimer)
end

W_temp = reshape(W_tempr, Nalal, Naa)
end

tic()
performance3()
toc()

elapsed time: 
1.480349334 seconds

尽管如此，与MATLAB完全不同 . 顺便说一句，在我的实际问题中，我运行循环轻松50k次，我正在访问更大的xprime矩阵，虽然不确定该部分是否有所作为 .

1 回答

因为我也在学习Julia，我已经尝试过加速OP的代码（对于我的练习！） . 似乎瓶颈本质上是底层的Fortran代码 . 为了验证这一点，我首先将OP的代码重写为最小形式，如下所示：

using Dierckx

function perf()

    Nx = 300 

    xinp = Float64[ 2pi * i / Nx for i = 1:Nx ]
    yinp = sin( xinp )

    interp = Spline1D( xinp, yinp )

    Nsample = 200 * 350

    x = ones( Nsample ) * pi / 6
    y = zeros( Nsample )

    for i = 1:500
        y[:] = evaluate( interp, x )
    end

    @show y[1:3]  # The result should be 0.5 (= sin(pi/6))
end

@time perf()
@time perf()
@time perf()

其中问题的大小保持不变，而输入的x和y坐标被改变，因此结果很简单（0.5） . 在我的机器上，结果是

y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
  1.886956 seconds (174.20 k allocations: 277.414 MB, 3.55% gc time)
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
  1.659290 seconds (1.56 k allocations: 269.295 MB, 0.39% gc time)
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
  1.648145 seconds (1.56 k allocations: 269.295 MB, 0.28% gc time)

从现在开始，为了简洁，我将省略[1：3]（我已经确认在所有情况下获得的y [1：3]都是正确的） . 如果我们用 copy!(y,x) 替换 evaluate() ，结果就变成了

0.206723 seconds (168.26 k allocations: 10.137 MB, 10.27% gc time)
  0.023068 seconds (60 allocations: 2.198 MB)
  0.023013 seconds (60 allocations: 2.198 MB)

所以基本上所有的时间都花在 evaluate() 上 . 现在看一下这个例程的original code，我们看到它在Fortran中调用了splev()，后者又调用fpbspl()（两者都来自Netlib） . 这些例程相当陈旧（写于〜1990年）并且对于当前的计算机似乎没有很好地优化（例如，有许多IF分支和矢量化可能很难...） . 因为代码不是很简单，所以我尝试用OpenMP进行并行化 . 修改后的 splev() 就像这样，输入点简单地分为线程：

subroutine splev(t,n,c,k,x,y,m,e,ier)
c  subroutine splev evaluates in a number of points x(i),i=1,2,...,m
c  a spline s(x) of degree k, given in its b-spline representation.
(... same here ...)

c  main loop for the different points.
c$omp parallel do default(shared)
c$omp.firstprivate(arg,ier,l1,l,ll,sp,h) private(i,j)
      do i = 1, m

c  fetch a new x-value arg.
        arg = x(i)
c  check if arg is in the support
        if (arg .lt. tb .or. arg .gt. te) then
            if (e .eq. 0) then
                goto 35
            else if (e .eq. 1) then
                y(i) = 0
                goto 80
            else if (e .eq. 2) then
                ier = 1
                ! goto 100        !! I skipped this error handling for simplicity.
            else if (e .eq. 3) then
                if (arg .lt. tb) then
                    arg = tb
                else
                    arg = te
                endif
            endif
        endif

c  search for knot interval t(l) <= arg < t(l+1)
 35     if ( t(l) <= arg .or. l1 == k2 ) go to 40
        l1 = l
        l = l - 1
        go to 35
  40    if ( arg < t(l1) .or. l == nk1 ) go to 50
        l = l1
        l1 = l + 1
        go to 40

c  evaluate the non-zero b-splines at arg.
  50    call fpbspl(t, n, k, arg, l, h)

c  find the value of s(x) at x=arg.
        sp = 0.0d0
        ll = l - k1

        do 60 j = 1, k1
          ll = ll + 1
          sp = sp + c(ll)*h(j)
  60    continue
        y(i) = sp

 80     continue

      enddo
c$omp end parallel do
 100  return
      end

现在用 gfortran -fopenmp 重建包并在上面调用 perf() 给出

$ OMP_NUM_THREADS=1 julia interp.jl
  1.911112 seconds (174.20 k allocations: 277.414 MB, 3.49% gc time)
  1.599154 seconds (1.56 k allocations: 269.295 MB, 0.41% gc time)
  1.671308 seconds (1.56 k allocations: 269.295 MB, 0.28% gc time)

$ OMP_NUM_THREADS=2 julia interp.jl
  1.308713 seconds (174.20 k allocations: 277.414 MB, 5.14% gc time)
  0.874616 seconds (1.56 k allocations: 269.295 MB, 0.46% gc time)
  0.897505 seconds (1.56 k allocations: 269.295 MB, 0.21% gc time)

$ OMP_NUM_THREADS=4 julia interp.jl
  0.749203 seconds (174.20 k allocations: 277.414 MB, 9.31% gc time)
  0.446702 seconds (1.56 k allocations: 269.295 MB, 0.93% gc time)
  0.439522 seconds (1.56 k allocations: 269.295 MB, 0.43% gc time)

$ OMP_NUM_THREADS=8 julia interp.jl
  0.478504 seconds (174.20 k allocations: 277.414 MB, 14.66% gc time)
  0.243258 seconds (1.56 k allocations: 269.295 MB, 1.81% gc time)
  0.233157 seconds (1.56 k allocations: 269.295 MB, 0.89% gc time)

$ OMP_NUM_THREADS=16 julia interp.jl
  0.379243 seconds (174.20 k allocations: 277.414 MB, 19.02% gc time)
  0.129145 seconds (1.56 k allocations: 269.295 MB, 3.49% gc time)
  0.124497 seconds (1.56 k allocations: 269.295 MB, 1.80% gc time)

# Julia: v0.4.0, Machine: Linux x86_64 (2.6GHz, Xeon2.60GHz, 16 cores)

所以缩放看起来很简单（但是如果我以这种方式使用OpenMP会犯一个大错误，请告诉我......）如果上面的结果是正确的，那就意味着需要8个线程才能使用这个Fortran代码来匹配OP 's machine. But the good news is that there are probably room for improvement of the Fortran codes (even for serial run). Anyway, the OP'程序的速度's machine. But the good news is that there are probably room for improvement of the Fortran codes (even for serial run). Anyway, the OP'（如最终形式）似乎是在比较底层插值程序的性能，即Matlab中的 interp1() 与Fortran中的 splev() .

回复于 2024-04-27T12:32:52+08:00

Julia vs MATLAB：为什么我的Julia代码这么慢？

1 回答

相关问题