The Parallel.ForEach
method is intended for parallelizing CPU-bound workloads. Downloading a file is an I/O bound workload, and so the Parallel.ForEach
is not ideal for this case because it needlessly blocks ThreadPool
threads. The correct way to do it is asynchronously, with async/await. The recommended class for making asynchronous web requests is the HttpClient
, and for controlling the level of concurrency an excellent option is the TPL Dataflow library. For this case it is enough to use the simplest component of this library, the ActionBlock
class:
async Task DownloadListAsync(List<string> list)
{
using (var httpClient = new HttpClient())
{
var rest = ExcludeDownloaded(list);
var block = new ActionBlock<string>(async link =>
{
await DownloadFileAsync(httpClient, link);
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 10
});
foreach (var link in rest)
{
await block.SendAsync(link);
}
block.Complete();
await block.Completion;
}
}
async Task DownloadFileAsync(HttpClient httpClient, string link)
{
var fileName = Guid.NewGuid().ToString(); // code to generate unique fileName;
var filePath = Path.Combine(SavePath, fileName);
if (File.Exists(filePath)) return;
var response = await httpClient.GetAsync(link);
response.EnsureSuccessStatusCode();
using (var contentStream = await response.Content.ReadAsStreamAsync())
using (var fileStream = new FileStream(filePath, FileMode.Create,
FileAccess.Write, FileShare.None, 32768, FileOptions.Asynchronous))
{
await contentStream.CopyToAsync(fileStream);
}
}
The code for downloading a file with HttpClient
is not as simple as the WebClient.DownloadFile()
, but it’s what you have to do in order to keep the whole process asynchronous (both reading from the web and writing to the disk).
Caveat: Asynchronous filesystem operations are currently not implemented efficiently in .NET. For maximum efficiency it may be preferable to avoid using the FileOptions.Asynchronous
option in the FileStream
constructor.
.NET 6 update: The preferable way for parallelizing asynchronous work is now the Parallel.ForEachAsync
API. A usage example can be found here.