Duplicate File Detective provides a file hash caching subsystem that can improve duplicate search execution performance in many scenarios.
When Duplicate File Detective is configured to compare file contents, it will generate content checksum values for files which require it. From a performance perspective, the creation of file checksums (e.g. "hashing") is the most expensive (e.g. the slowest) operation in the duplicate analysis process chain. File hashing requires that files being analyzed are read in full, resulting in performance that is primarily driven by disk (and sometimes network) I/O.
Hash caching can help mitigate the performance impact of file hashing operations for subsequent project runs by storing hash results in memory (and optionally on disk) for later re-use. Before a file is hashed during a duplicate search, Duplicate File Detective queries the cache to see if a hash value for the file already exists. If a cache entry already exists and has the same creation date, modification date, and size as the current file, its hash value can be re-used (as opposed to being recomputed). The resulting performance gain can be dramatic, particularly when the files in question are large.
Hash caching will be most valuable (e.g. provide the greatest performance gain) in scenarios where duplicate search paths are being re-used, and the files these search paths contain are relatively large. For example, if you are regularly searching a common set of user directories for duplicate files (by comparing file contents), the file hash cache can dramatically speed up subsequent searches, particularly when those files are semi-static (e.g. change infrequently).
For projects that do not use file content comparison, the hash cache is not used.
The hash cache is limited by size (e.g. a specific number of cache entries) so as to constrain its usage of host system memory. You can adjust the maximum number of hash cache entries to suit your needs - larger values grant the cache more room to grow, while also increasing memory consumption.
You can also specify the minimum cache candidate file size. Files smaller than this value (which is is specified in KB) will never have their hash values cached.
When the hash cache becomes full, it will remove (or prune) a percentage of existing entries in order to make room for new ones. Because the expense of file hashing is directly proportional to a file's size, the hash cache prefers larger files (the preference is proportional to the file size). Therefore, entries will be removed based upon a combination of age and corresponding file size.
Users have the option of persisting the hash cache on disk between sessions. When engaged, the cache will be saved to (and loaded) from disk upon program exist and start, respectively.
Users can also clear the hash cache explicitly via the Clear Cache Now button. This might be useful in cases where you know that existing cache entries no longer have value (e.g. previous duplicate file search paths are very unlikely to be used again in the future).
The statistics area of the hash cache shows a range of related metrics, including the number of current entries, smallest and largest entry file size, session hits, etc.
Users that become familiar with how hash caching operates may find these metrics useful in fine-tuning available settings.