我有一个项目,我在Haskell中构建一个Decision Tree . 生成的树将具有多个彼此独立的分支,因此我认为它们可以并行构建 .

DecisionTree 数据类型的定义如下:

data DecisionTree =
    Question Filter DecisionTree DecisionTree |    
    Answer DecisionTreeResult

instance NFData DecisionTree where
    rnf (Answer dtr)            = rnf dtr
    rnf (Question fil dt1 dt2)  = rnf fil `seq` rnf dt1 `seq` rnf dt2

这是构造树的算法的一部分

constructTree :: TrainingParameters -> [Map String Value] -> Filter -> Either String DecisionTree    
constructTree trainingParameters trainingData fil =    
    if informationGain trainingData (parseFilter fil) < entropyLimit trainingParameters    
    then constructAnswer (targetVariable trainingParameters) trainingData    
    else
        Question fil <$> affirmativeTree <*> negativeTree `using` evalTraversable parEvalTree    
        where   affirmativeTree   = trainModel trainingParameters passedTData    
                negativeTree      = trainModel trainingParameters failedTData    
                passedTData       = filter (parseFilter fil) trainingData    
                failedTData       = filter (not . parseFilter fil) trainingData

parEvalTree :: Strategy DecisionTree    
parEvalTree (Question f dt1 dt2) = do    
    dt1' <- rparWith rdeepseq dt1    
    dt2' <- rparWith rdeepseq dt2    
    return $ Question f dt1' dt2'
parEvalTree ans = return ans

trainModel 递归调用 constructTree . 并行的相关路线是

Question fil <$> affirmativeTree <*> negativeTree `using` evalTraversable parEvalTree

我用GHC标志 -threaded -O2 -rtsopts -eventlog 构建它并用 stack exec -- performance-test +RTS -A200M -N -s -l 运行它(我在2核机器上) .

但它似乎并没有并行运行

SPARKS: 164 (60 converted, 0 overflowed, 0 dud, 0 GC'd, 104 fizzled)

INIT    time    0.000s  (  0.009s elapsed)
MUT     time   29.041s  ( 29.249s elapsed)
GC      time    0.048s  (  0.015s elapsed)
EXIT    time    0.001s  (  0.006s elapsed)
Total   time   29.091s  ( 29.279s elapsed)

我怀疑使用 rdeepseq 和并行策略的递归调用可能存在一些问题 . 如果一些经验丰富的Haskeller会发出声音,那真的会让我的一天成真:)