没有替换的采样算法？-Java 学习之路

我试图测试特定数据集群偶然发生的可能性 . 一种强有力的方法是蒙特卡罗模拟，其中数据和组之间的关联被随机重新分配很多次（例如10,000），并且使用聚类度量来比较实际数据与模拟以确定ap值 .

我已经完成了大部分工作，使用指针将分组映射到数据元素，因此我计划随机重新分配指向数据的指针 . 问题：在没有替换的情况下采样的快速方法是什么，以便在复制数据集中随机重新分配每个指针？

例如（这些数据只是一个简化的例子）：

数据（n = 12值） - A组：0.1,0.2,0.4 / B组：0.5,0.6,0.8 / C组：0.4,0.5 / D组：0.2,0.2,0.3,0.5

对于每个复制数据集，我将具有相同的簇大小（A = 3，B = 3，C = 2，D = 4）和数据值，但会将值重新分配给簇 .

为此，我可以生成1-12范围内的随机数，分配A组的第一个元素，然后生成1-11范围内的随机数，并分配A组中的第二个元素，依此类推 . 指针重新分配很快，我将预先分配所有数据结构，但没有替换的采样似乎是一个可能已经解决过很多次的问题 .

逻辑或伪代码首选 .

6 回答

5

请参阅我对这个问题的回答Unique (non-repeating) random numbers in O(1)? . 同样的逻辑应该完成你想要做的事情 .

回复于 2024-04-27T04:49:25+08:00

这里有一些基于Knuth的书籍Seminumeric Algorithms的算法3.4.2S的无需替换的代码 .

void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far
    double u;

    while (m < n)
    {
        u = GetUniform(); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

Jeffrey Scott Vitter在“An Efficient Algorithm for Sequential Random Sampling”，ACM Transactions on Mathematical Software，13（1），1987年3月，58-67中有一种更有效但更复杂的方法 .

回复于 2024-04-27T04:49:25+08:00

基于answer by John D. Cook的C工作代码 .

#include <random>
#include <vector>

double GetUniform()
{
    static std::default_random_engine re;
    static std::uniform_real_distribution<double> Dist(0,1);
    return Dist(re);
}

// John D. Cook, https://stackoverflow.com/a/311716/15485
void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    std::vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far
    double u;

    while (m < n)
    {
        u = GetUniform(); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

#include <iostream>
int main(int,char**)
{
  const size_t sz = 10;
  std::vector< int > samples(sz);
  SampleWithoutReplacement(10*sz,sz,samples);
  for (size_t i = 0; i < sz; i++ ) {
    std::cout << samples[i] << "\t";
  }

  return 0;
}

回复于 2024-04-27T04:49:25+08:00

受@John D. Cook's answer的启发，我在Nim写了一个实现 . 起初我很难理解它是如何工作的，所以我还广泛评论了一个例子 . 也许这有助于理解这个想法 . 另外，我稍微更改了变量名称 .

iterator uniqueRandomValuesBelow*(N, M: int) =
  ## Returns a total of M unique random values i with 0 <= i < N
  ## These indices can be used to construct e.g. a random sample without replacement
  assert(M <= N)

  var t = 0 # total input records dealt with
  var m = 0 # number of items selected so far

  while (m < M):
    let u = random(1.0) # call a uniform(0,1) random number generator

    # meaning of the following terms:
    # (N - t) is the total number of remaining draws left (initially just N)
    # (M - m) is the number how many of these remaining draw must be positive (initially just M)
    # => Probability for next draw = (M-m) / (N-t)
    #    i.e.: (required positive draws left) / (total draw left)
    #
    # This is implemented by the inequality expression below:
    # - the larger (M-m), the larger the probability of a positive draw
    # - for (N-t) == (M-m), the term on the left is always smaller => we will draw 100%
    # - for (N-t) >> (M-m), we must get a very small u
    #
    # example: (N-t) = 7, (M-m) = 5
    # => we draw the next with prob 5/7
    #    lets assume the draw fails
    # => t += 1 => (N-t) = 6
    # => we draw the next with prob 5/6
    #    lets assume the draw succeeds
    # => t += 1, m += 1 => (N-t) = 5, (M-m) = 4
    # => we draw the next with prob 4/5
    #    lets assume the draw fails
    # => t += 1 => (N-t) = 4
    # => we draw the next with prob 4/4, i.e.,
    #    we will draw with certainty from now on
    #    (in the next steps we get prob 3/3, 2/2, ...)
    if (N - t)*u >= (M - m).toFloat: # this is essentially a draw with P = (M-m) / (N-t)
      # no draw -- happens mainly for (N-t) >> (M-m) and/or high u
      t += 1
    else:
      # draw t -- happens when (M-m) gets large and/or low u
      yield t # this is where we output an index, can be used to sample
      t += 1
      m += 1

# example use
for i in uniqueRandomValuesBelow(100, 5):
  echo i

回复于 2024-04-27T04:49:25+08:00

当种群大小远大于样本大小时，上述算法变得低效，因为它们具有复杂度O（n），n是种群大小 .

当我还是学生时，我写了一些算法，无需替换即可统一采样，平均复杂度为O（s log s），其中s是样本大小 . 以下是二叉树算法的代码，平均复杂度为O（s log s），在R中：

# The Tree growing algorithm for uniform sampling without replacement
# by Pavel Ruzankin 
quicksample = function (n,size)
# n - the number of items to choose from
# size - the sample size
{
  s=as.integer(size)
  if (s>n) {
    stop("Sample size is greater than the number of items to choose from")
  }
  # upv=integer(s) #level up edge is pointing to
  leftv=integer(s) #left edge is poiting to; must be filled with zeros
  rightv=integer(s) #right edge is pointig to; must be filled with zeros
  samp=integer(s) #the sample
  ordn=integer(s) #relative ordinal number

  ordn[1L]=1L #initial value for the root vertex
  samp[1L]=sample(n,1L) 
  if (s > 1L) for (j in 2L:s) {
    curn=sample(n-j+1L,1L) #current number sampled
    curordn=0L #currend ordinal number
    v=1L #current vertice
    from=1L #how have come here: 0 - by left edge, 1 - by right edge
    repeat {
      curordn=curordn+ordn[v]
      if (curn+curordn>samp[v]) { #going down by the right edge
        if (from == 0L) {
          ordn[v]=ordn[v]-1L
        }
        if (rightv[v]!=0L) {
          v=rightv[v]
          from=1L
        } else { #creating a new vertex
          samp[j]=curn+curordn
          ordn[j]=1L
          # upv[j]=v
          rightv[v]=j
          break
        }
      } else { #going down by the left edge
        if (from==1L) {
          ordn[v]=ordn[v]+1L
        }
        if (leftv[v]!=0L) {
          v=leftv[v]
          from=0L
        } else { #creating a new vertex
          samp[j]=curn+curordn-1L
          ordn[j]=-1L
          # upv[j]=v
          leftv[v]=j
          break
        }
      }
    }
  }
  return(samp)  
}

该算法的复杂性在以下讨论：Rouzankin，P . S . ; Voytishek，A . V.关于随机选择算法的成本 . 蒙特卡罗方法Appl . 5（1999），没有 . 1,39-54 . http://dx.doi.org/10.1515/mcma.1999.5.1.39

如果您发现该算法有用，请参考 .

另见：P . Gupta，G . P. Bhattacharjee . （1984）一种无需替换的随机抽样的有效算法 . 国际计算机数学杂志16：4，第201-209页 . DOI：10.1080 / 00207168408803438

Teuhola，J . 和Nevalainen，O . 1982.两种有效的随机抽样算法，无需替换 . / IJCM /，11（2）：127-140 . DOI：10.1080 / 00207168208803304

在上一篇论文中，作者使用哈希表并声称他们的算法具有O（s）复杂性 . 还有一个快速哈希表算法，很快就会在pqR中实现（很快R）：https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html

回复于 2024-04-27T04:49:25+08:00

1

另一种无需替换的采样算法被描述为here .

它类似于John D. Cook在他的回答和Knuth中描述的那个，但它有不同的假设：种群大小未知，但样本可以适合记忆 . 这个被称为“Knuth算法S” .

引用rosettacode文章：

选择前n个项目作为样本，因为它们可用;对于i> n的第i个项目，有一个随机的n / i保留它的机会 . 如果失败这个机会，样本保持不变 . 如果没有，请随机（1 / n）替换之前选择的n个样本中的一个 . 对任何后续项重复＃2 .

回复于 2024-04-27T04:49:25+08:00

没有替换的采样算法？

6 回答

相关问题