我正在尝试实现一个通用文件系统爬虫,例如 - 能够枚举从给定根开始的所有子文件夹.我想使用async/await/Task范例来做到这一点.
以下是我的代码.它有效,但我怀疑它可以改进.特别是,注释Task.WaitAll
会导致深层目录树中出现不必要的等待,因为循环在每个树级别等待而不是立即继续处理要添加到的新文件夹folderQueue
.
不知怎的,我想包括被添加到新的文件夹folderQueue
中的任务列表中正在等待上Task.WaitAll()
,而在WaitAll
过程中.这甚至可能吗?
using System; using System.Collections.Generic; using System.Diagnostics; using System.IO; using System.Linq; using System.Threading.Tasks; class FileSystemCrawlerSO { static void Main(string[] args) { FileSystemCrawlerSO crawler = new FileSystemCrawlerSO(); Stopwatch watch = new Stopwatch(); watch.Start(); crawler.CollectFolders(@"d:\www"); watch.Stop(); Console.WriteLine($"Collected {crawler.NumFolders:N0} folders in {watch.ElapsedMilliseconds} milliseconds."); if (Debugger.IsAttached) Console.ReadKey(); } public int NumFolders { get; set; } private readonly QueuefolderQueue; public FileSystemCrawlerSO() { folderQueue = new Queue (); } public void CollectFolders(string path) { DirectoryInfo directoryInfo = new DirectoryInfo(path); lock (folderQueue) folderQueue.Enqueue(directoryInfo); List tasks = new List (); do { tasks.Clear(); lock (folderQueue) { while (folderQueue.Any()) { var folder = folderQueue.Dequeue(); Task task = Task.Run(() => CrawlFolder(folder)); tasks.Add(task); } } if (tasks.Any()) { Console.WriteLine($"Waiting for {tasks.Count} tasks..."); Task.WaitAll(tasks.ToArray()); //<== NOTE: THIS IS NOT OPTIMAL } } while (tasks.Any()); } private void CrawlFolder(DirectoryInfo dir) { try { DirectoryInfo[] directoryInfos = dir.GetDirectories(); lock (folderQueue) foreach (DirectoryInfo childInfo in directoryInfos) folderQueue.Enqueue(childInfo); // Do something with the current folder // e.g. Console.WriteLine($"{dir.FullName}"); NumFolders++; } catch (Exception ex) { while (ex != null) { Console.WriteLine($"{ex.GetType()} {ex.Message}\n{ex.StackTrace}"); ex = ex.InnerException; } } } }
Stephen Clea.. 13
理论上,async/await应该能够在这里提供帮助.在实践中,没有那么多.这是因为Win32不公开目录函数(或某些文件函数,如打开文件)的异步API.
此外,通过使用多个线程(Task.Run
)并行化磁盘访问具有适得其反的趋势,特别是对于传统(非SSD)磁盘.并行文件系统访问(与串行文件系统访问相反)往往会导致磁盘抖动,从而降低整体吞吐量.
所以,在一般情况下,我建议只使用阻塞目录枚举方法.例如:
class FileSystemCrawlerSO { static void Main(string[] args) { var numFolders = 0; Stopwatch watch = new Stopwatch(); watch.Start(); foreach (var dir in Directory.EnumerateDirectories(@"d:\www", "*", SearchOption.AllDirectories)) { // Do something with the current folder // e.g. Console.WriteLine($"{dir.FullName}"); ++numFolders; } watch.Stop(); Console.WriteLine($"Collected {numFolders:N0} folders in {watch.ElapsedMilliseconds} milliseconds."); if (Debugger.IsAttached) Console.ReadKey(); } }
这样做的一个很好的副作用就是文件夹计数器变量(NumFolders
)不再存在竞争条件.
对于控制台应用程序,您只需要做.如果要将其放入UI应用程序并且您不想阻止UI线程,那么单个 Task.Run
就足够了.
理论上,async/await应该能够在这里提供帮助.在实践中,没有那么多.这是因为Win32不公开目录函数(或某些文件函数,如打开文件)的异步API.
此外,通过使用多个线程(Task.Run
)并行化磁盘访问具有适得其反的趋势,特别是对于传统(非SSD)磁盘.并行文件系统访问(与串行文件系统访问相反)往往会导致磁盘抖动,从而降低整体吞吐量.
所以,在一般情况下,我建议只使用阻塞目录枚举方法.例如:
class FileSystemCrawlerSO { static void Main(string[] args) { var numFolders = 0; Stopwatch watch = new Stopwatch(); watch.Start(); foreach (var dir in Directory.EnumerateDirectories(@"d:\www", "*", SearchOption.AllDirectories)) { // Do something with the current folder // e.g. Console.WriteLine($"{dir.FullName}"); ++numFolders; } watch.Stop(); Console.WriteLine($"Collected {numFolders:N0} folders in {watch.ElapsedMilliseconds} milliseconds."); if (Debugger.IsAttached) Console.ReadKey(); } }
这样做的一个很好的副作用就是文件夹计数器变量(NumFolders
)不再存在竞争条件.
对于控制台应用程序,您只需要做.如果要将其放入UI应用程序并且您不想阻止UI线程,那么单个 Task.Run
就足够了.
这是我的建议。我使用泛型Concurrent*<>
类,因此我不必自己照顾锁(尽管这不会自动提高性能)。
然后,我为每个文件夹启动一个任务并放入ConcurrentBag
。开始第一个任务后,我总是等待包中的第一个任务,如果没有其他任务可以等待,我将结束。
public class FileSystemCrawlerSO { public int NumFolders { get; set; } private readonly ConcurrentQueuefolderQueue = new ConcurrentQueue (); private readonly ConcurrentBag tasks = new ConcurrentBag (); public void CollectFolders(string path) { DirectoryInfo directoryInfo = new DirectoryInfo(path); tasks.Add(Task.Run(() => CrawlFolder(directoryInfo))); Task taskToWaitFor; while (tasks.TryTake(out taskToWaitFor)) taskToWaitFor.Wait(); } private void CrawlFolder(DirectoryInfo dir) { try { DirectoryInfo[] directoryInfos = dir.GetDirectories(); foreach (DirectoryInfo childInfo in directoryInfos) { // here may be dragons using enumeration variable as closure!! DirectoryInfo di = childInfo; tasks.Add(Task.Run(() => CrawlFolder(di))); } // Do something with the current folder // e.g. Console.WriteLine($"{dir.FullName}"); NumFolders++; } catch(Exception ex) { while (ex != null) { Console.WriteLine($"{ex.GetType()} {ex.Message}\n{ex.StackTrace}"); ex = ex.InnerException; } } } }
我尚未测量这是否比您的解决方案快。但是我认为(正如Yacoub Massad所说),瓶颈将是IO系统本身,而不是组织任务的方式。