我有一个项目,我在Haskell中构建一个Decision Tree . 生成的树将具有多个彼此独立的分支,因此我认为它们可以并行构建 .
DecisionTree
数据类型的定义如下:
data DecisionTree =
Question Filter DecisionTree DecisionTree |
Answer DecisionTreeResult
instance NFData DecisionTree where
rnf (Answer dtr) = rnf dtr
rnf (Question fil dt1 dt2) = rnf fil `seq` rnf dt1 `seq` rnf dt2
这是构造树的算法的一部分
constructTree :: TrainingParameters -> [Map String Value] -> Filter -> Either String DecisionTree
constructTree trainingParameters trainingData fil =
if informationGain trainingData (parseFilter fil) < entropyLimit trainingParameters
then constructAnswer (targetVariable trainingParameters) trainingData
else
Question fil <$> affirmativeTree <*> negativeTree `using` evalTraversable parEvalTree
where affirmativeTree = trainModel trainingParameters passedTData
negativeTree = trainModel trainingParameters failedTData
passedTData = filter (parseFilter fil) trainingData
failedTData = filter (not . parseFilter fil) trainingData
parEvalTree :: Strategy DecisionTree
parEvalTree (Question f dt1 dt2) = do
dt1' <- rparWith rdeepseq dt1
dt2' <- rparWith rdeepseq dt2
return $ Question f dt1' dt2'
parEvalTree ans = return ans
trainModel
递归调用 constructTree
. 并行的相关路线是
Question fil <$> affirmativeTree <*> negativeTree `using` evalTraversable parEvalTree
我用GHC标志 -threaded -O2 -rtsopts -eventlog
构建它并用 stack exec -- performance-test +RTS -A200M -N -s -l
运行它(我在2核机器上) .
但它似乎并没有并行运行
SPARKS: 164 (60 converted, 0 overflowed, 0 dud, 0 GC'd, 104 fizzled)
INIT time 0.000s ( 0.009s elapsed)
MUT time 29.041s ( 29.249s elapsed)
GC time 0.048s ( 0.015s elapsed)
EXIT time 0.001s ( 0.006s elapsed)
Total time 29.091s ( 29.279s elapsed)
我怀疑使用 rdeepseq
和并行策略的递归调用可能存在一些问题 . 如果一些经验丰富的Haskeller会发出声音,那真的会让我的一天成真:)