与Project Euler进行速度比较：C vs Python vs Erlang vs Haskell-Java 学习之路

611

我从Project Euler获取Problem #12作为编程练习，并比较我在C，Python，Erlang和Haskell中的（当然不是最优的）实现 . 为了获得更高的执行时间，我搜索了第一个三角形数字，其中有超过1000个除数而不是原始问题中所述的500 .

结果如下：

lorenzo@enzo:~/erlang$ gcc -lm -o euler12.bin euler12.c
lorenzo@enzo:~/erlang$ time ./euler12.bin
842161320

real    0m11.074s
user    0m11.070s
sys 0m0.000s

Python:

lorenzo@enzo:~/erlang$ time ./euler12.py 
842161320

real    1m16.632s
user    1m16.370s
sys 0m0.250s

Python with PyPy:

lorenzo@enzo:~/Downloads/pypy-c-jit-43780-b590cf6de419-linux64/bin$ time ./pypy /home/lorenzo/erlang/euler12.py 
842161320

real    0m13.082s
user    0m13.050s
sys 0m0.020s

Erlang:

lorenzo@enzo:~/erlang$ erlc euler12.erl 
lorenzo@enzo:~/erlang$ time erl -s euler12 solve
Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.4  (abort with ^G)
1> 842161320

real    0m48.259s
user    0m48.070s
sys 0m0.020s

Haskell:

lorenzo@enzo:~/erlang$ ghc euler12.hs -o euler12.hsx
[1 of 1] Compiling Main             ( euler12.hs, euler12.o )
Linking euler12.hsx ...
lorenzo@enzo:~/erlang$ time ./euler12.hsx 
842161320

real    2m37.326s
user    2m37.240s
sys 0m0.080s

Summary:

C：100％
Python：692％（使用PyPy时为118％）
Erlang：436％（135％归功于RichardC）
Haskell：1421％

我认为C有一个很大的优势，因为它使用long来计算，而不是任意长度整数作为其他三个 . 此外，它不需要首先加载运行时（其他人？） .

Question 1: Erlang，Python和Haskell由于使用任意长度整数而失去速度，或者只要值小于 MAXINT 就不会失败吗？

Question 2: 为什么Haskell这么慢？是否有编译器标志关闭刹车或是我的实施？（后者非常可能，因为Haskell是一本带有七个印章的书 . ）

Question 3: 您能否提供一些提示，如何优化这些实施而不改变我确定因素的方式？以任何方式进行优化：更好，更快，更多"native"语言 .

EDIT:

Question 4: 我的功能实现是否允许LCO（最后一次调用优化，a.k.a尾递归消除），从而避免在调用堆栈中添加不必要的帧？

我真的试图在四种语言中尽可能地实现相同的算法，尽管我不得不承认我的Haskell和Erlang知识非常有限 .

使用的源代码：

#include <stdio.h>
#include <math.h>

int factorCount (long n)
{
    double square = sqrt (n);
    int isquare = (int) square;
    int count = isquare == square ? -1 : 0;
    long candidate;
    for (candidate = 1; candidate <= isquare; candidate ++)
        if (0 == n % candidate) count += 2;
    return count;
}

int main ()
{
    long triangle = 1;
    int index = 1;
    while (factorCount (triangle) < 1001)
    {
        index ++;
        triangle += index;
    }
    printf ("%ld\n", triangle);
}

#! /usr/bin/env python3.2

import math

def factorCount (n):
    square = math.sqrt (n)
    isquare = int (square)
    count = -1 if isquare == square else 0
    for candidate in range (1, isquare + 1):
        if not n % candidate: count += 2
    return count

triangle = 1
index = 1
while factorCount (triangle) < 1001:
    index += 1
    triangle += index

print (triangle)

-module (euler12).
-compile (export_all).

factorCount (Number) -> factorCount (Number, math:sqrt (Number), 1, 0).

factorCount (_, Sqrt, Candidate, Count) when Candidate > Sqrt -> Count;

factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;

factorCount (Number, Sqrt, Candidate, Count) ->
    case Number rem Candidate of
        0 -> factorCount (Number, Sqrt, Candidate + 1, Count + 2);
        _ -> factorCount (Number, Sqrt, Candidate + 1, Count)
    end.

nextTriangle (Index, Triangle) ->
    Count = factorCount (Triangle),
    if
        Count > 1000 -> Triangle;
        true -> nextTriangle (Index + 1, Triangle + Index + 1)  
    end.

solve () ->
    io:format ("~p~n", [nextTriangle (1, 1) ] ),
    halt (0).

factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
    where square = sqrt $ fromIntegral number
          isquare = floor square

factorCount' number sqrt candidate count
    | fromIntegral candidate > sqrt = count
    | number `mod` candidate == 0 = factorCount' number sqrt (candidate + 1) (count + 2)
    | otherwise = factorCount' number sqrt (candidate + 1) count

nextTriangle index triangle
    | factorCount triangle > 1000 = triangle
    | otherwise = nextTriangle (index + 1) (triangle + index + 1)

main = print $ nextTriangle 1 1

18 回答

尝试GO：

package main

import "fmt"
import "math"

func main() {
    var n, m, c int
    for i := 1; ; i++ {
        n, m, c = i * (i + 1) / 2, int(math.Sqrt(float64(n))), 0
        for f := 1; f < m; f++ {
            if n % f == 0 { c++ }
    }
    c *= 2
    if m * m == n { c ++ }
    if c > 1001 {
        fmt.Println(n)
        break
        }
    }
}

我明白了：

原版c版本：9.1690 100%
去：8.2520 111%

但使用：

package main

import (
    "math"
    "fmt"
 )

// Sieve of Eratosthenes
func PrimesBelow(limit int) []int {
    switch {
        case limit < 2:
            return []int{}
        case limit == 2:
            return []int{2}
    }
    sievebound := (limit - 1) / 2
    sieve := make([]bool, sievebound+1)
    crosslimit := int(math.Sqrt(float64(limit))-1) / 2
    for i := 1; i <= crosslimit; i++ {
        if !sieve[i] {
            for j := 2 * i * (i + 1); j <= sievebound; j += 2*i + 1 {
                sieve[j] = true
            }
        }
    }
    plimit := int(1.3*float64(limit)) / int(math.Log(float64(limit)))
    primes := make([]int, plimit)
    p := 1
    primes[0] = 2
    for i := 1; i <= sievebound; i++ {
        if !sieve[i] {
            primes[p] = 2*i + 1
            p++
            if p >= plimit {
                break
            }
        }
    }
    last := len(primes) - 1
    for i := last; i > 0; i-- {
        if primes[i] != 0 {
            break
        }
        last = i
    }
    return primes[0:last]
}



func main() {
    fmt.Println(p12())
}
// Requires PrimesBelow from utils.go
func p12() int {
    n, dn, cnt := 3, 2, 0
    primearray := PrimesBelow(1000000)
    for cnt <= 1001 {
        n++
        n1 := n
        if n1%2 == 0 {
            n1 /= 2
        }
        dn1 := 1
        for i := 0; i < len(primearray); i++ {
            if primearray[i]*primearray[i] > n1 {
                dn1 *= 2
                break
            }
            exponent := 1
            for n1%primearray[i] == 0 {
                exponent++
                n1 /= primearray[i]
            }
            if exponent > 1 {
                dn1 *= exponent
            }
            if n1 == 1 {
                break
            }
        }
        cnt = dn * dn1
        dn = dn1
    }
    return n * (n - 1) / 2
}

我明白了：

原版c版本：9.1690 100%
thaumkid的c版本：0.1060 8650%
先去版本：8.2520 111%
第二版：0.0230 39865%

我也试过Python3.6和pypy3.3-5.5-alpha：

原c版：8.629 100%
thaumkid的c版本：0.109 7916%
Python3.6：54.795 16%
pypy3.3-5.5-alpha：13.291 65%

然后用我得到的以下代码：

原c版：8.629 100%
thaumkid的c版本：0.109 8650%
Python3.6：1.489 580%
pypy3.3-5.5-alpha：0.582 1483%

def D(N):
    if N == 1: return 1
    sqrtN = int(N ** 0.5)
    nf = 1
    for d in range(2, sqrtN + 1):
        if N % d == 0:
            nf = nf + 1
    return 2 * nf - (1 if sqrtN**2 == N else 0)

L = 1000
Dt, n = 0, 0

while Dt <= L:
    t = n * (n + 1) // 2
    Dt = D(n/2)*D(n+1) if n%2 == 0 else D(n)*D((n+1)/2)
    n = n + 1

print (t)

回复于 2024-04-25T22:45:12+08:00

13
使用Haskell，您实际上不需要明确地考虑递归 .
```
factorCount number = foldr factorCount' 0 [1..isquare] -
                     (fromEnum $ square == fromIntegral isquare)
    where
      square = sqrt $ fromIntegral number
      isquare = floor square
      factorCount' candidate
        | number `rem` candidate == 0 = (2 +)
        | otherwise = id

triangles :: [Int]
triangles = scanl1 (+) [1,2..]

main = print . head $ dropWhile ((< 1001) . factorCount) triangles
```
在上面的代码中，我已经用@Thomas的回答用常见的列表操作替换了显式递归 . 代码仍然完全相同，没有我们担心尾递归 . 它运行（~ 7.49s ）约 6% 慢于@Thomas'答案（~ 7.04s ）中使用GHC 7.6.2的机器上的版本，而@Raedwulf的C版运行~ 3.15s . 似乎GHC在过去一年中有所改善 .

PS . 我知道这是一个老问题，我从谷歌搜索中偶然发现它（我忘了我在搜索，现在......） . 只是想对LCO的问题发表评论，并表达我对Haskell的总体感受 . 我想对最佳答案发表评论，但评论不允许代码块 .
回复于 2024-04-25T22:45:12+08:00
30
C版本的更多数字和解释 . 这些年来，显然没有人这样做过 . 请记住，请回答这个问题，以便每个人都能看到和学习 .

第一步：作者计划的基准

笔记本规格：
- CPU i3 M380（931 MHz - 最大电池节省模式）
- 4GB内存
- Win7 64位
- Microsoft Visual Studio 2012 Ultimate
- Cygwin与gcc 4.9.3
- Python 2.7.10
命令：
```
compiling on VS x64 command prompt > `for /f %f in ('dir /b *.c') do cl /O2 /Ot /Ox %f -o %f_x64_vs2012.exe`
compiling on cygwin with gcc x64   > `for f in ./*.c; do gcc -m64 -O3 $f -o ${f}_x64_gcc.exe ; done`
time (unix tools) using cygwin > `for f in ./*.exe; do  echo "----------"; echo $f ; time $f ; done`
```
.
```
----------
$ time python ./original.py

real    2m17.748s
user    2m15.783s
sys     0m0.093s
----------
$ time ./original_x86_vs2012.exe

real    0m8.377s
user    0m0.015s
sys     0m0.000s
----------
$ time ./original_x64_vs2012.exe

real    0m8.408s
user    0m0.000s
sys     0m0.015s
----------
$ time ./original_x64_gcc.exe

real    0m20.951s
user    0m20.732s
sys     0m0.030s
```
文件名是： integertype_architecture_compiler.exe
- integertype 与现在的原始程序相同（稍后会详细介绍）
- architecture 是x86或x64，具体取决于编译器设置
- compiler 是gcc或vs2012
第二步：再次调查，改进和基准

VS比gcc快250％ . 两个编译器应该提供类似的速度 . 显然，代码或编译器选项有问题 . 我们来调查吧！

第一个兴趣点是整数类型 . 转换可能很昂贵，而且一致性对于更好的代码生成和优化非常重要 . 所有整数应该是相同的类型 .

现在是 int 和 long 的混乱 . 我们're going to improve that. What type to use? The fastest. Gotta benchmark them'全部！
```
----------
$ time ./int_x86_vs2012.exe

real    0m8.440s
user    0m0.016s
sys     0m0.015s
----------
$ time ./int_x64_vs2012.exe

real    0m8.408s
user    0m0.016s
sys     0m0.015s
----------
$ time ./int32_x86_vs2012.exe

real    0m8.408s
user    0m0.000s
sys     0m0.015s
----------
$ time ./int32_x64_vs2012.exe

real    0m8.362s
user    0m0.000s
sys     0m0.015s
----------
$ time ./int64_x86_vs2012.exe

real    0m18.112s
user    0m0.000s
sys     0m0.015s
----------
$ time ./int64_x64_vs2012.exe

real    0m18.611s
user    0m0.000s
sys     0m0.015s
----------
$ time ./long_x86_vs2012.exe

real    0m8.393s
user    0m0.015s
sys     0m0.000s
----------
$ time ./long_x64_vs2012.exe

real    0m8.440s
user    0m0.000s
sys     0m0.015s
----------
$ time ./uint32_x86_vs2012.exe

real    0m8.362s
user    0m0.000s
sys     0m0.015s
----------
$ time ./uint32_x64_vs2012.exe

real    0m8.393s
user    0m0.015s
sys     0m0.015s
----------
$ time ./uint64_x86_vs2012.exe

real    0m15.428s
user    0m0.000s
sys     0m0.015s
----------
$ time ./uint64_x64_vs2012.exe

real    0m15.725s
user    0m0.015s
sys     0m0.015s
----------
$ time ./int_x64_gcc.exe

real    0m8.531s
user    0m8.329s
sys     0m0.015s
----------
$ time ./int32_x64_gcc.exe

real    0m8.471s
user    0m8.345s
sys     0m0.000s
----------
$ time ./int64_x64_gcc.exe

real    0m20.264s
user    0m20.186s
sys     0m0.015s
----------
$ time ./long_x64_gcc.exe

real    0m20.935s
user    0m20.809s
sys     0m0.015s
----------
$ time ./uint32_x64_gcc.exe

real    0m8.393s
user    0m8.346s
sys     0m0.015s
----------
$ time ./uint64_x64_gcc.exe

real    0m16.973s
user    0m16.879s
sys     0m0.030s
```
整数类型是 int long int32_t uint32_t int64_t 和 uint64_t 来自 #include <stdint.h>

C中有很多整数类型，还有一些有符号/无符号可供使用，还有编译为x86或x64的选择（不要与实际的整数大小混淆） . 这是很多版本来编译和运行^^

第三步：理解数字

明确的结论：
- 32 bits integers are ~200% faster than 64 bits equivalents
- unsigned 64 bits 整数比25％快 signed 64 bits （不幸的是，我没有解释）
特技提问："What are the sizes of int and long in C?"
正确答案是： The size of int and long in C are not well-defined!

From the C spec:

int至少32位长至少是一个int

From the gcc man page (-m32 and -m64 flags):

32位环境设置int，long和指向32位的指针，并生成在任何i386系统上运行的代码 . 64位环境将int设置为32位且长，指向64位，并为AMD的x86-64架构生成代码 .

From MSDN documentation (Data Type Ranges) https://msdn.microsoft.com/en-us/library/s3f49ktz%28v=vs.110%29.aspx :

int，4个字节，也知道为signed long，4个字节，也知道long int和signed long int

总结：经验教训
- 32位整数比64位整数快 .
- 标准整数类型在C和C中没有很好地定义，它们根据编译器和体系结构而有所不同 . 当您需要一致性和可预测性时，请使用 #include <stdint.h> 中的 uint32_t 整数族 .
- 速度问题已解决 . 所有其他语言都落后了百分之百，C＆C再次获胜！他们总是那样做 . 下一个改进将是使用OpenMP的多线程：D
回复于 2024-04-25T22:45:12+08:00

通过使用Haskell包中的一些函数，可以大大加快Haskell的实现 . 在这种情况下，我使用了primes，它只是安装了'cabal install primes';）

import Data.Numbers.Primes
import Data.List

triangleNumbers = scanl1 (+) [1..]
nDivisors n = product $ map ((+1) . length) (group (primeFactors n))
answer = head $ filter ((> 500) . nDivisors) triangleNumbers

main :: IO ()
main = putStrLn $ "First triangle number to have over 500 divisors: " ++ (show answer)

时序：

你原来的节目：

PS> measure-command { bin\012_slow.exe }

TotalSeconds      : 16.3807409
TotalMilliseconds : 16380.7409

改进实施

PS> measure-command { bin\012.exe }

TotalSeconds      : 0.0383436
TotalMilliseconds : 38.3436

正如你所看到的，这个在同一台机器上运行38毫秒，你的运行时间为16秒:)

编译命令：

ghc -O2 012.hs -o bin\012.exe
ghc -O2 012_slow.hs -o bin\012_slow.exe

回复于 2024-04-25T22:45:12+08:00

我假设如果涉及的数字有很多小因素，那么因子的数量才会很大 . 所以我使用了thaumkid的优秀算法，但是首先使用的因子计数的近似值永远不会太小 . 这很简单：检查最高为29的素数因子，然后检查剩余数量并计算因子数的上限 . 使用此值计算因子数的上限，如果该数字足够高，则计算确切的因子数 .

下面的代码不需要这个假设的正确性，但要快 . 它似乎工作;只有大约100,000个数字中的一个给出了足够高的估计值，需要进行全面检查 .

这是代码：

// Return at least the number of factors of n.
static uint64_t approxfactorcount (uint64_t n)
{
    uint64_t count = 1, add;

#define CHECK(d)                            \
    do {                                    \
        if (n % d == 0) {                   \
            add = count;                    \
            do { n /= d; count += add; }    \
            while (n % d == 0);             \
        }                                   \
    } while (0)

    CHECK ( 2); CHECK ( 3); CHECK ( 5); CHECK ( 7); CHECK (11); CHECK (13);
    CHECK (17); CHECK (19); CHECK (23); CHECK (29);
    if (n == 1) return count;
    if (n < 1ull * 31 * 31) return count * 2;
    if (n < 1ull * 31 * 31 * 37) return count * 4;
    if (n < 1ull * 31 * 31 * 37 * 37) return count * 8;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41) return count * 16;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43) return count * 32;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47) return count * 64;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53) return count * 128;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59) return count * 256;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61) return count * 512;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61 * 67) return count * 1024;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61 * 67 * 71) return count * 2048;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61 * 67 * 71 * 73) return count * 4096;
    return count * 1000000;
}

// Return the number of factors of n.
static uint64_t factorcount (uint64_t n)
{
    uint64_t count = 1, add;

    CHECK (2); CHECK (3);

    uint64_t d = 5, inc = 2;
    for (; d*d <= n; d += inc, inc = (6 - inc))
        CHECK (d);

    if (n > 1) count *= 2; // n must be a prime number
    return count;
}

// Prints triangular numbers with record numbers of factors.
static void printrecordnumbers (uint64_t limit)
{
    uint64_t record = 30000;

    uint64_t count1, factor1;
    uint64_t count2 = 1, factor2 = 1;

    for (uint64_t n = 1; n <= limit; ++n)
    {
        factor1 = factor2;
        count1 = count2;

        factor2 = n + 1; if (factor2 % 2 == 0) factor2 /= 2;
        count2 = approxfactorcount (factor2);

        if (count1 * count2 > record)
        {
            uint64_t factors = factorcount (factor1) * factorcount (factor2);
            if (factors > record)
            {
                printf ("%lluth triangular number = %llu has %llu factors\n", n, factor1 * factor2, factors);
                record = factors;
            }
        }
    }
}

这在大约0.7秒内找到了14,753,024个三角形，13824个因子，在34秒内找到了879,207,615个三角形数字，61,440个因子，在10分5秒内找到了12,524,486,975个三角形数字，138,240个因子，以及含有172,032个因子的26,467,792,064个三角形数字21分25秒（2.4GHz Core2 Duo），所以这段代码平均每个数字只需要116个处理器周期 . 最后一个三角形数字本身大于2 ^ 68，所以

回复于 2024-04-25T22:45:12+08:00

C++11, < 20ms for me - Run it here

我知道你需要提示来帮助提高你的语言知识，但是由于这里有很好的介绍，我想我会为那些可能已经查看了mathematica对你的问题的评论等的人添加一些背景，并想知道为什么这个代码太慢了 .

This answer is mainly to provide context to hopefully help people evaluate the code in your question / other answers more easily.

此代码仅使用几个（丑陋的）优化，与所使用的语言无关，基于：

每个traingle数的形式为n（n 1）/ 2
n和n 1是互质的
除数的数量是乘法函数

#include <iostream>
#include <cmath>
#include <tuple>
#include <chrono>

using namespace std;

// Calculates the divisors of an integer by determining its prime factorisation.

int get_divisors(long long n)
{
    int divisors_count = 1;

    for(long long i = 2;
        i <= sqrt(n);
        /* empty */)
    {
        int divisions = 0;
        while(n % i == 0)
        {
            n /= i;
            divisions++;
        }

        divisors_count *= (divisions + 1);

        //here, we try to iterate more efficiently by skipping
        //obvious non-primes like 4, 6, etc
        if(i == 2)
            i++;
        else
            i += 2;
    }

    if(n != 1) //n is a prime
        return divisors_count * 2;
    else
        return divisors_count;
}

long long euler12()
{
    //n and n + 1
    long long n, n_p_1;

    n = 1; n_p_1 = 2;

    // divisors_x will store either the divisors of x or x/2
    // (the later iff x is divisible by two)
    long long divisors_n = 1;
    long long divisors_n_p_1 = 2;

    for(;;)
    {
        /* This loop has been unwound, so two iterations are completed at a time
         * n and n + 1 have no prime factors in common and therefore we can
         * calculate their divisors separately
         */

        long long total_divisors;                 //the divisors of the triangle number
                                                  // n(n+1)/2

        //the first (unwound) iteration

        divisors_n_p_1 = get_divisors(n_p_1 / 2); //here n+1 is even and we

        total_divisors =
                  divisors_n
                * divisors_n_p_1;

        if(total_divisors > 1000)
            break;

        //move n and n+1 forward
        n = n_p_1;
        n_p_1 = n + 1;

        //fix the divisors
        divisors_n = divisors_n_p_1;
        divisors_n_p_1 = get_divisors(n_p_1);   //n_p_1 is now odd!

        //now the second (unwound) iteration

        total_divisors =
                  divisors_n
                * divisors_n_p_1;

        if(total_divisors > 1000)
            break;

        //move n and n+1 forward
        n = n_p_1;
        n_p_1 = n + 1;

        //fix the divisors
        divisors_n = divisors_n_p_1;
        divisors_n_p_1 = get_divisors(n_p_1 / 2);   //n_p_1 is now even!
    }

    return (n * n_p_1) / 2;
}

int main()
{
    for(int i = 0; i < 1000; i++)
    {
        using namespace std::chrono;
        auto start = high_resolution_clock::now();
        auto result = euler12();
        auto end = high_resolution_clock::now();

        double time_elapsed = duration_cast<milliseconds>(end - start).count();

        cout << result << " " << time_elapsed << '\n';
    }
    return 0;
}

我的桌面平均需要19毫秒，笔记本电脑需要80毫秒，这与我在这里看到的大多数其他代码相差甚远 . 毫无疑问，仍有许多优化措施可供使用 .

回复于 2024-04-25T22:45:12+08:00

742
问题1：由于使用任意长度整数，erlang，python和haskell的速度是否松动，或者只要值小于MAXINT，它们是否为空？

这不太可能 . 我不能多说Erlang和Haskell（好吧，也许有点关于下面的Haskell）但我可以指出Python中的许多其他瓶颈 . 每次程序尝试使用Python中的某些值执行操作时，它应该验证值是否来自正确的类型，并且它花费了一些时间 . 你的 factorCount 函数只是分配一个 range (1, isquare + 1) 不同时间的列表，并且运行时， malloc -styled内存分配比使用计数器的范围迭代慢得多，就像在C中一样 . 值得注意的是， factorCount() 被多次调用，所以分配了很多列表另外，我们不要忘记解释Python并且CPython解释器没有很好地专注于优化 .

EDIT ：哦，好吧，我注意到你使用的是Python 3，所以 range() 不返回列表，而是返回一个生成器 . 在这种情况下，我关于分配列表的观点是错误的：函数只是分配 range 对象，但这些对象效率低，但不如分配包含大量项目的列表效率低 .

问题2：为什么haskell这么慢？是否有编译器标志关闭刹车或是我的实施？（后者很有可能因为haskell是一本带有七个印章的书 . ）

你在用Hugs吗？拥抱是一个相当慢的翻译 . 如果你正在使用它，也许你可以用GHC获得更好的时间 - 但我只是在思考hypotesis，一个好的Haskell编译器做的东西是非常迷人的，超出了我的理解:)

问题3：您能否提供一些提示，如何优化这些实施而不改变我确定因素的方式？以任何方式进行优化：更好，更快，更“本地”的语言 .

我会说你正在玩一个不完整的游戏 . 了解各种语言的最好的部分是尽可能以最不同的方式使用它们:)但我离题了，我对这一点没有任何建议 . 对不起，我希望有人能在这种情况下帮助你:)

问题4：我的功能实现是否允许LCO，从而避免在调用堆栈中添加不必要的帧？

据我记忆，你只需要确保你的递归调用是返回值之前的最后一个命令 . 换句话说，像下面这样的函数可以使用这样的优化：
```
def factorial(n, acc=1):
    if n > 1:
        acc = acc * n
        n = n - 1
        return factorial(n, acc)
    else:
        return acc
```
但是，如果您的函数如下所示，则不会有这样的优化，因为在递归调用之后有一个操作（乘法）：
```
def factorial2(n):
    if n > 1:
        f = factorial2(n-1)
        return f*n
    else:
        return 1
```
我将一些局部变量中的操作分开，以便清楚执行哪些操作 . 但是，最常见的是看到如下所示的这些功能，但它们与我所提出的点相同：
```
def factorial(n, acc=1):
    if n > 1:
        return factorial(n-1, acc*n)
    else:
        return acc

def factorial2(n):
    if n > 1:
        return n*factorial(n-1)
    else:
        return 1
```
请注意，由编译器/解释器决定是否进行尾递归 . 例如，如果我记得很清楚，Python解释器就不会这样做（我在我的例子中使用Python只是因为它的语法流畅） . 无论如何，如果你发现奇怪的东西，如带有两个参数的阶乘函数（其中一个参数的名称如 acc ， accumulator 等）现在你知道为什么人们这样做:)
回复于 2024-04-25T22:45:12+08:00
5

看看你的Erlang实现 . 时机包括启动整个虚拟机，运行程序和暂停虚拟机 . 我很确定设置和停止erlang vm需要一些时间 .

如果时间是在erlang虚拟机本身内完成的，那么结果会有所不同，因为在这种情况下，我们只有相关程序的实际时间 . 否则，我相信启动和加载Erlang Vm的过程所花费的总时间加上停止它的过程（当你把它放在你的程序中时）都包含在你用来计时的方法的总时间内 . 程序正在输出 . 考虑使用我们在虚拟机本身计算时间时使用的erlang计时本身timer:tc/1 or timer:tc/2 or timer:tc/3 . 这样，来自erlang的结果将排除启动和停止/终止/暂停虚拟机所花费的时间 . 这是我在那里的推理，想一想，然后再次尝试你的替补标记 .

我实际上建议我们尝试在这些语言的运行时间内为程序（对于具有运行时的语言）计时，以获得精确的值 . 例如，C和Erlang，Python和Haskell一样没有开始和关闭运行时系统的开销（98％肯定这个 - 我站在校正中） . 所以（根据这个推理）我最后说，这个基准测试对运行在运行时系统之上的语言来说不够精确/公平 . 让我们再次尝试这些变化 .

编辑：除非所有语言都有运行时系统，否则启动每个语言并暂停它的开销会有所不同 . 所以我建议我们从运行时系统内部开始计时（对于适用的语言） . 众所周知，Erlang VM在启动时会有相当大的开销！

回复于 2024-04-25T22:45:12+08:00
150
Erlang实现存在一些问题 . 作为以下内容的基线，我测量的未修改Erlang程序的执行时间为47.6秒，而C代码为12.7秒 .

如果要运行计算密集型Erlang代码，首先应该使用本机代码 . 使用 erlc +native euler12 进行编译时间缩短至41.3秒 . 然而，这种代码的本机编译速度要低得多（仅为15％），问题在于你使用了 -compile(export_all) . 这对于实验很有用，但是所有函数都可以从外部访问的事实导致本机编译器非常保守 . （正常的BEAM模拟器受影响不大 . ）用 -export([solve/0]). 替换此声明可以提供更好的加速：31.5秒（距离基线几乎35％） .

但是代码本身存在一个问题：对于factorCount循环中的每次迭代，您都执行此测试：
```
factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;
```
C代码不会这样做 . 一般来说，在相同代码的不同实现之间进行公平比较可能会很棘手，特别是如果算法是数字的，因为您需要确保它们实际上在做同样的事情 . 由于某些类型转换导致一个实现中的轻微舍入错误可能导致它比另一个执行更多迭代，即使两者最终都达到相同的结果 .

消除这种可能性错误源（并在每次迭代中摆脱额外的测试），我重写了factorCount函数如下，紧密模仿C代码：
```
factorCount (N) ->
    Sqrt = math:sqrt (N),
    ISqrt = trunc(Sqrt),
    if ISqrt == Sqrt -> factorCount (N, ISqrt, 1, -1);
       true          -> factorCount (N, ISqrt, 1, 0)
    end.

factorCount (_N, ISqrt, Candidate, Count) when Candidate > ISqrt -> Count;
factorCount ( N, ISqrt, Candidate, Count) ->
    case N rem Candidate of
        0 -> factorCount (N, ISqrt, Candidate + 1, Count + 2);
        _ -> factorCount (N, ISqrt, Candidate + 1, Count)
    end.
```
这个重写，没有 export_all 和本机编译，给了我以下运行时间：
```
$ erlc +native euler12.erl
$ time erl -noshell -s euler12 solve
842161320

real    0m19.468s
user    0m19.450s
sys 0m0.010s
```
这与C代码相比并不算太糟糕：
```
$ time ./a.out 
842161320

real    0m12.755s
user    0m12.730s
sys 0m0.020s
```
考虑到Erlang完全不是为了编写数字代码，在像这样的程序上只比C慢50％是非常好的 .

最后，关于你的问题：

Question 1: Do erlang, python and haskell loose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

是的，有点 . 在Erlang中，没有办法说"use 32/64-bit arithmetic with wrap-around"，所以除非编译器可以证明你的整数有一些边界（而且它通常不能），它必须检查所有的计算，看看它们是否适合单个标记的单词或者它是否适合必须将它们变成堆分配的bignums . 即使在运行中没有经常使用bignums，也必须执行这些检查 . 另一方面，这意味着您知道算法永远不会因为意外的整数环绕而失败，如果您突然给它提供比以前更大的输入 .

Question 4: Do my functional implementations permit LCO and hence avoid adding unnecessary frames onto the call stack?

是的，关于上次呼叫优化，您的Erlang代码是正确的 .
回复于 2024-04-25T22:45:12+08:00

#include <stdio.h>
#include <math.h>

int factorCount (long n)
{
    double square = sqrt (n);
    int isquare = (int) square+1;
    long candidate = 2;
    int count = 1;
    while(candidate <= isquare && candidate<=n){
        int c = 1;
        while (n % candidate == 0) {
           c++;
           n /= candidate;
        }
        count *= c;
        candidate++;
    }
    return count;
}

int main ()
{
    long triangle = 1;
    int index = 1;
    while (factorCount (triangle) < 1001)
    {
        index ++;
        triangle += index;
    }
    printf ("%ld\n", triangle);
}

gcc -lm -Ofast euler.c

时间./a.out

2.79s用户0.00s系统99％cpu 2.794总计

回复于 2024-04-25T22:45:12+08:00

0

看看this blog . 在过去一年左右的时间里，他通常发现Haskell要快得多 . 我认为在这些语言之间它更多地与你的流畅性和编码风格有关 .

说到Python速度，你使用了错误的实现！试试PyPy，对于这样的事情，你会发现它要快得多 .

回复于 2024-04-25T22:45:12+08:00

关于Python优化，除了使用PyPy（对于代码无变化的非常令人印象深刻的加速），您可以使用PyPy的translation toolchain来编译兼容RPython的版本，或者Cython来构建扩展模块，两者都是使用Cython模块 nearly twice as fast ，在测试中比C版本更快 . 作为参考，我还包括C和PyPy基准测试结果：

C（用 gcc -O3 -lm 编译）

% time ./euler12-c 
842161320

./euler12-c  11.95s 
 user 0.00s 
 system 99% 
 cpu 11.959 total

PyPy 1.5

% time pypy euler12.py
842161320
pypy euler12.py  
16.44s user 
0.01s system 
99% cpu 16.449 total

RPython（使用最新的PyPy修订版， c2f583445aee ）

% time ./euler12-rpython-c
842161320
./euler12-rpy-c  
10.54s user 0.00s 
system 99% 
cpu 10.540 total

Cython 0.15

% time python euler12-cython.py
842161320
python euler12-cython.py  
6.27s user 0.00s 
system 99% 
cpu 6.274 total

RPython版本有几个关键的变化 . 要转换为独立程序，您需要定义 target ，在本例中为 main 函数 . 它应该接受 sys.argv 作为它唯一的参数，并且需要返回一个int . 您可以使用translate.py， % translate.py euler12-rpython.py 进行翻译，该翻译为C并为您编译 .

# euler12-rpython.py

import math, sys

def factorCount(n):
    square = math.sqrt(n)
    isquare = int(square)
    count = -1 if isquare == square else 0
    for candidate in xrange(1, isquare + 1):
        if not n % candidate: count += 2
    return count

def main(argv):
    triangle = 1
    index = 1
    while factorCount(triangle) < 1001:
        index += 1
        triangle += index
    print triangle
    return 0

if __name__ == '__main__':
    main(sys.argv)

def target(*args):
    return main, None

Cython版本被重写为扩展模块 _euler12.pyx ，我从普通的python文件导入和调用 . _euler12.pyx 与您的版本基本相同，还有一些额外的静态类型声明 . setup.py具有使用 python setup.py build_ext --inplace 构建扩展的常规样板 .

# _euler12.pyx
from libc.math cimport sqrt

cdef int factorCount(int n):
    cdef int candidate, isquare, count
    cdef double square
    square = sqrt(n)
    isquare = int(square)
    count = -1 if isquare == square else 0
    for candidate in range(1, isquare + 1):
        if not n % candidate: count += 2
    return count

cpdef main():
    cdef int triangle = 1, index = 1
    while factorCount(triangle) < 1001:
        index += 1
        triangle += index
    print triangle

# euler12-cython.py
import _euler12
_euler12.main()

# setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules = [Extension("_euler12", ["_euler12.pyx"])]

setup(
  name = 'Euler12-Cython',
  cmdclass = {'build_ext': build_ext},
  ext_modules = ext_modules
)

老实说，我对RPython或Cython的经验很少，并对结果感到惊喜 . 如果您正在使用CPython，那么在Cython扩展模块中编写CPU密集型代码似乎是优化程序的一种非常简单的方法 .

回复于 2024-04-25T22:45:12+08:00

8
问题3：您能否提供一些提示，如何优化这些实施而不改变我确定因素的方式？以任何方式进行优化：更好，更快，更“本地”的语言 .

C实现不是最理想的（正如Thomas M. DuBuisson所暗示的），该版本使用64位整数（即 long 数据类型） . 我'll investigate the assembly listing later, but with an educated guess, there are some memory accesses going on in the compiled code, which make using 64-bit integers significantly slower. It' s那个或生成的代码（事实上，你可以在SSE寄存器中使用更少的64位整数或将双精度数转换为64位整数更慢） .

这是修改后的代码（简单地用int替换long，我明确地内联了factorCount，尽管我不认为这对gcc -O3是必要的）：
```
#include <stdio.h>
#include <math.h>

static inline int factorCount(int n)
{
    double square = sqrt (n);
    int isquare = (int)square;
    int count = isquare == square ? -1 : 0;
    int candidate;
    for (candidate = 1; candidate <= isquare; candidate ++)
        if (0 == n % candidate) count += 2;
    return count;
}

int main ()
{
    int triangle = 1;
    int index = 1;
    while (factorCount (triangle) < 1001)
    {
        index++;
        triangle += index;
    }
    printf ("%d\n", triangle);
}
```
运行时间给出：
```
$ gcc -O3 -lm -o euler12 euler12.c; time ./euler12
842161320
./euler12  2.95s user 0.00s system 99% cpu 2.956 total
```
作为参考，托马斯在早期答案中的haskell实现给出了：
```
$ ghc -O2 -fllvm -fforce-recomp euler12.hs; time ./euler12                                                                                      [9:40]
[1 of 1] Compiling Main             ( euler12.hs, euler12.o )
Linking euler12 ...
842161320
./euler12  9.43s user 0.13s system 99% cpu 9.602 total
```
结论：没有什么可以远离ghc，它是一个很好的编译器，但是gcc通常会生成更快的代码 .
回复于 2024-04-25T22:45:12+08:00
54

变化： case (divisor(T,round(math:sqrt(T))) > 500) of

致： case (divisor(T,round(math:sqrt(T))) > 1000) of

这将为Erlang多进程示例生成正确答案 .

回复于 2024-04-25T22:45:12+08:00

我将“Jannich Brendle”版本修改为1000而不是500.并列出了euler12.bin，euler12.erl，p12dist.erl的结果 . 两个erl代码都使用“native”进行编译 .

zhengs-MacBook-Pro:workspace zhengzhibin$ time erl -noshell -s p12dist start
The result is: 842161320.

real    0m3.879s
user    0m14.553s
sys     0m0.314s
zhengs-MacBook-Pro:workspace zhengzhibin$ time erl -noshell -s euler12 solve
842161320

real    0m10.125s
user    0m10.078s
sys     0m0.046s
zhengs-MacBook-Pro:workspace zhengzhibin$ time ./euler12.bin 
842161320

real    0m5.370s
user    0m5.328s
sys     0m0.004s
zhengs-MacBook-Pro:workspace zhengzhibin$

回复于 2024-04-25T22:45:12+08:00

211

纯娱乐 . 以下是更“原生”的Haskell实现：

import Control.Applicative
import Control.Monad
import Data.Either
import Math.NumberTheory.Powers.Squares

isInt :: RealFrac c => c -> Bool
isInt = (==) <$> id <*> fromInteger . round

intSqrt :: (Integral a) => a -> Int
--intSqrt = fromIntegral . floor . sqrt . fromIntegral
intSqrt = fromIntegral . integerSquareRoot'

factorize :: Int -> [Int]
factorize 1 = []
factorize n = first : factorize (quot n first)
  where first = (!! 0) $ [a | a <- [2..intSqrt n], rem n a == 0] ++ [n]

factorize2 :: Int -> [(Int,Int)]
factorize2 = foldl (\ls@((val,freq):xs) y -> if val == y then (val,freq+1):xs else (y,1):ls) [(0,0)] . factorize

numDivisors :: Int -> Int
numDivisors = foldl (\acc (_,y) -> acc * (y+1)) 1 <$> factorize2

nextTriangleNumber :: (Int,Int) -> (Int,Int)
nextTriangleNumber (n,acc) = (n+1,acc+n+1)

forward :: Int -> (Int, Int) -> Either (Int, Int) (Int, Int)
forward k val@(n,acc) = if numDivisors acc > k then Left val else Right (nextTriangleNumber val)

problem12 :: Int -> (Int, Int)
problem12 n = (!!0) . lefts . scanl (>>=) (forward n (1,1)) . repeat . forward $ n

main = do
  let (n,val) = problem12 1000
  print val

使用 ghc -O3 ，这在我的机器（1.73GHz Core i7）上持续运行0.55-0.58秒 .

C版本更有效的factorCount函数：

int factorCount (int n)
{
  int count = 1;
  int candidate,tmpCount;
  while (n % 2 == 0) {
    count++;
    n /= 2;
  }
    for (candidate = 3; candidate < n && candidate * candidate < n; candidate += 2)
    if (n % candidate == 0) {
      tmpCount = 1;
      do {
        tmpCount++;
        n /= candidate;
      } while (n % candidate == 0);
       count*=tmpCount;
      }
  if (n > 1)
    count *= 2;
  return count;
}

使用 gcc -O3 -lm 将long更改为in中的int，这始终在0.31-0.35秒内运行 .

如果你利用第n个三角形数= n *（n 1）/ 2，并且n和（n 1）具有完全不同的主要因子分解这一事实，两者都可以更快地运行，因此每个因子的数量可以乘以一半来查找整体因子的数量 . 下列：

int main ()
{
  int triangle = 0,count1,count2 = 1;
  do {
    count1 = count2;
    count2 = ++triangle % 2 == 0 ? factorCount(triangle+1) : factorCount((triangle+1)/2);
  } while (count1*count2 < 1001);
  printf ("%lld\n", ((long long)triangle)*(triangle+1)/2);
}

会减少c代码运行时间为0.17-0.19秒，它可以处理更大的搜索 - 在我的机器上大于10000个因子大约需要43秒 . 我给感兴趣的读者留下了类似的haskell加速 .

回复于 2024-04-25T22:45:12+08:00

问题1：Erlang，Python和Haskell是否由于使用任意长度整数而失去速度，或者只要值小于MAXINT就不会失败？

问题一可以回答Erlang的否定 . 最后一个问题是通过适当使用Erlang来回答的，如：

http://bredsaal.dk/learning-erlang-using-projecteuler-net

由于它比你最初的C例子快，我猜它会有很多问题，因为其他人已经详细介绍了 .

这个Erlang模块在大约5秒钟内在便宜的上网本上执行......它使用erlang中的网络线程模型，并且演示如何利用事件模型 . 它可以分布在许多节点上 . 它很快 . 不是我的代码 .

-module(p12dist).  
-author("Jannich Brendle, jannich@bredsaal.dk, http://blog.bredsaal.dk").  
-compile(export_all).

server() ->  
  server(1).

server(Number) ->  
  receive {getwork, Worker_PID} -> Worker_PID ! {work,Number,Number+100},  
  server(Number+101);  
  {result,T} -> io:format("The result is: \~w.\~n", [T]);  
  _ -> server(Number)  
  end.

worker(Server_PID) ->  
  Server_PID ! {getwork, self()},  
  receive {work,Start,End} -> solve(Start,End,Server_PID)  
  end,  
  worker(Server_PID).

start() ->  
  Server_PID = spawn(p12dist, server, []),  
  spawn(p12dist, worker, [Server_PID]),  
  spawn(p12dist, worker, [Server_PID]),  
  spawn(p12dist, worker, [Server_PID]),  
  spawn(p12dist, worker, [Server_PID]).

solve(N,End,_) when N =:= End -> no_solution;

solve(N,End,Server_PID) ->  
  T=round(N*(N+1)/2),
  case (divisor(T,round(math:sqrt(T))) > 500) of  
    true ->  
      Server_PID ! {result,T};  
    false ->  
      solve(N+1,End,Server_PID)  
  end.

divisors(N) ->  
  divisor(N,round(math:sqrt(N))).

divisor(_,0) -> 1;  
divisor(N,I) ->  
  case (N rem I) =:= 0 of  
  true ->  
    2+divisor(N,I-1);  
  false ->  
    divisor(N,I-1)  
  end.

下面的测试发生在：Intel（R）Atom（TM）CPU N270 @ 1.60GHz

~$ time erl -noshell -s p12dist start

The result is: 76576500.

^C

BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
a

real    0m5.510s
user    0m5.836s
sys 0m0.152s

回复于 2024-04-25T22:45:12+08:00

66
在x86_64 Core2 Duo（2.5GHz）机器上使用 GHC 7.0.3 ， gcc 4.4.6 ， Linux 2.6.29 ，使用 ghc -O2 -fllvm -fforce-recomp 进行Haskell编译，使用 gcc -O3 -lm 进行C编译 .
- 您的C例程在8.4秒内运行（比运行速度快，可能是因为 -O3 ）
- Haskell解决方案在36秒内运行（由于 -O2 标志）
- 你的 factorCount' 代码没有明确输入并且默认为 Integer （感谢Daniel在这里纠正我的误诊！） . 使用 Int 给出显式类型签名（无论如何都是标准做法），时间更改为 11.1 seconds
  在 factorCount'
- 你不必要地叫 fromIntegral . 修复导致没有变化（编译器很聪明，很幸运） .
- 您使用 mod ，其中 rem 更快更充足 . 这会将时间更改为 8.5 seconds .
- factorCount' 不断应用两个永不改变的额外参数（ number ， sqrt ） . Worker /包装器转换为我们提供：
```
$ time ./so
 842161320  

 real    0m7.954s  
 user    0m7.944s  
 sys     0m0.004s
```
那是对的， 7.95 seconds . 始终如一 half a second faster than the C solution . 如果没有 -fllvm 标志，我仍然会得到 8.182 seconds ，所以NCG后端在这种情况下表现也很好 .

结论：Haskell非常棒 .

Resulting Code
```
factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
    where square = sqrt $ fromIntegral number
          isquare = floor square

factorCount' :: Int -> Int -> Int -> Int -> Int
factorCount' number sqrt candidate0 count0 = go candidate0 count0
  where
  go candidate count
    | candidate > sqrt = count
    | number `rem` candidate == 0 = go (candidate + 1) (count + 2)
    | otherwise = go (candidate + 1) count

nextTriangle index triangle
    | factorCount triangle > 1000 = triangle
    | otherwise = nextTriangle (index + 1) (triangle + index + 1)

main = print $ nextTriangle 1 1
```
编辑：所以现在我们已经探索过，让我们解决问题

问题1：erlang，python和haskell是否因为使用任意长度整数而失去速度，或者只要值小于MAXINT就不会失去速度？

在Haskell中，使用 Integer 比 Int 慢，但慢多少取决于执行的计算 . 幸运的是（对于64位机器） Int 就足够了 . 为了便于携带，您应该重写我的代码以使用 Int64 或 Word64 （C不是唯一具有 long 的语言） .

问题2：为什么haskell这么慢？是否有编译器标志关闭刹车或是我的实施？（后者非常可能，因为haskell是一本带有七个印章的书 . ）问题3：你能否提供一些提示如何优化这些实现而不改变我确定因素的方式？以任何方式进行优化：更好，更快，更“本地”的语言 .

这就是我上面回答的问题 . 答案是
- 0）通过 -O2 使用优化
- 1）尽可能使用快速（特别是：unbox-able）类型
- 2） rem not mod （经常被遗忘的优化）和
- 3）worker / wrapper转换（也许是最常见的优化） .
问题4：我的功能实现是否允许LCO，从而避免在调用堆栈中添加不必要的帧？

是的，那不是问题 . 干得好，很高兴你考虑过这个 .
回复于 2024-04-25T22:45:12+08:00

与Project Euler进行速度比较：C vs Python vs Erlang vs Haskell

18 回答

第一步：作者计划的基准

第二步：再次调查，改进和基准

第三步：理解数字

总结：经验教训

相关问题