使用未知大小的队列的Web爬网程序的生产环境者/使用者-Java 学习之路

我需要抓取父网页及其子网页，我遵循http://www.albahari.com/threading/part4.aspx#%5FWait%5Fand%5FPulse的生产环境者/消费者概念 . 另外，我使用5个线程将链接排队和出列 .

有关如何在队列长度未知的情况下，如果所有线程都已完成处理队列，我将如何结束/加入所有线程？

以下是关于我如何编码的想法 .

static void Main(string[] args)
{
    //enqueue parent links here
    ...
    //then start crawling via threading
    ...
}

public void Crawl()
{
   //dequeue
   //get child links
   //enqueue child links
}

4 回答

0
如果所有线程都空闲（即等待队列）并且队列为空，那么您就完成了 .

处理这种情况的一种简单方法是让线程在尝试访问队列时使用超时 . 像BlockingCollection.TryTake之类的东西 . 每当 TryTake 超时时，线程会更新一个字段以说明它已空闲多长时间：
```
while (!queue.TryTake(out item, 5000, token))
{
    if (token.IsCancellationRequested)
        break;
    // here, update idle counter
}
```
然后，您可以拥有一个每15秒左右执行一次的计时器来检查所有线程的空闲计数器 . 如果所有线程都空闲了一段时间（可能是一分钟），那么计时器可以设置取消令牌 . 这将杀死所有线程 . 您的主程序也可以监控取消令牌 .

顺便说一下，你可以在没有 BlockingCollection 和取消的情况下这样做 . 你'll just have to create your own cancellation signaling mechanism, and if you'重新使用队列上的锁，你可以用 Monitor.TryEnter 替换锁语法等 .

还有其他几种方法可以解决这个问题，尽管它们需要对您的程序进行一些重大的重组 .
回复于 2024-04-26T01:40:47+08:00

您可以在末尾将虚拟令牌排入队列，并在遇到此令牌时让线程退出 . 喜欢：

public void Crawl()
{
   int report = 0;
   while(true)
   {
       if(!(queue.Count == 0))      
       {   
          if(report > 0) Interlocked.Decrement(ref report);
          //dequeue     
          if(token == "TERMINATION")
             return;
          else
             //enqueue child links
       }
       else
       {              
          if(report == num_threads) // all threads have signaled empty queue
             queue.Enqueue("TERMINATION");
          else
             Interlocked.Increment(ref report); // this thread has found the queue empty
       }
    }
}

当然，我省略了 enqueue/dequeue 操作的锁 .

回复于 2024-04-26T01:40:47+08:00

线程可以发出信号，表示已经结束了他们的工作，例如举起一个事件，或者调用一个代表 .

static void Main(string[] args)
{
//enqueue parent links here
...
//then start crawling via threading
...
}

public void X()
{
    //block the threads until all of them are here
}

public void Crawl(Action x)
{
    //dequeue
    //get child links
    //enqueue child links
    //call x()
}

回复于 2024-04-26T01:40:47+08:00

如果你愿意使用Task Parallel Library，真的不需要手动处理生产环境者 - 消费者的东西 . 使用 AttachToParent 选项创建任务时，子任务将与父任务链接，以便在子任务完成之前不会完成 .

class Program
{
    static void Main(string[] args)
    {
        var task = CrawlAsync("http://stackoverflow.com");
        task.Wait();
    }

    static Task CrawlAsync(string url)
    {
        return Task.Factory.StartNew(
            () =>
            {
                string[] children = ExtractChildren(url);
                foreach (string child in children)
                {
                    CrawlAsync(child);
                }
                ProcessUrl(url);
            }, TaskCreationOptions.AttachedToParent);
    }

    static string[] ExtractChildren(string root)
    {
      // Return all child urls here.
    }

    static void ProcessUrl(string url)
    {
      // Process the url here.
    }
}

您可以使用 Parallel.ForEach 删除一些显式任务创建逻辑 .

回复于 2024-04-26T01:40:47+08:00

使用未知大小的队列的Web爬网程序的 生产环境 者/使用者

4 回答

相关问题

使用未知大小的队列的Web爬网程序的生产环境者/使用者