galileofile-reporter
Content Hash Duplicate File Report
Updated: Galileo File Reporter
25.2+ 25.2+
Summary
This report uses the file-content-hash feature. The scan policy provides an option to generate file content hashes for All Files or Files uploaded since the last scan. When enabled, AgentFS generates a SHA256 hash of each file's content and stores it in the database, where it can be compared against other files with matching content.
Details
The report paths are managed in the Report Designer, not in the query.
Line 33 contains the minimum duplicate count for an item to be included in the report.
Line 34 contains the minimum file size (in bytes) to be included in the report.
Code
SQL Server
WITH
q AS (SELECT sd.fullpath,
sd.size,
sd.create_time,
sd.modify_time,
sd.access_time,
sd.name,
COUNT(*) OVER (PARTITION BY sd.content_hash) AS item_count,
Sum(sd.size) OVER (PARTITION BY sd.content_hash) AS total_hash_size,
srs.bytes_to_hex_string(sd.content_hash) as content_hash
FROM srs.scan_data AS sd
INNER JOIN #tmp_cq_fs_paths AS cp ON cp.scan_id = sd.scan_id
AND cp.ns_left <= sd.ns_left
AND cp.ns_right >= sd.ns_right
AND cp.is_filesystem_scan = 'true'
AND cp.is_current = 'true'
WHERE sd.path_type = 1
AND sd.content_hash IS NOT NULL
)
SELECT
q.fullpath,
q.item_count,
srs.byte_string(q.size) AS size_string,
srs.byte_string(CAST(q.total_hash_size AS BIGINT)) AS total_size_string,
(q.total_hash_size - q.size) AS wasted_space,
srs.byte_string(CAST((q.total_hash_size - q.size) AS BIGINT)) AS wasted_space_string,
q.total_hash_size,
q.size,
q.content_hash
FROM
q
WHERE
(q.item_count >= 2) AND
(q.size > 10 * 1024 * 1024)
ORDER BY q.fullpath
PostgreSQL
WITH
q AS (SELECT sd.fullpath,
sd.size,
sd.create_time,
sd.modify_time,
sd.access_time,
sd.name,
COUNT(*) OVER (PARTITION BY sd.content_hash) AS item_count,
Sum(sd.size) OVER (PARTITION BY sd.content_hash) AS total_hash_size,
srs.bytes_to_hex_string(sd.content_hash) as content_hash
FROM srs.scan_data AS sd
INNER JOIN tmp_cq_fs_paths AS cp ON (cp.scan_id = sd.scan_id)
AND (cp.ns_left <= sd.ns_left)
AND (cp.ns_right >= sd.ns_right)
AND (cp.is_filesystem_scan = 'true')
AND (cp.is_current = 'true')
WHERE sd.path_type = 1
AND sd.content_hash IS NOT NULL
)
SELECT
q.fullpath,
q.item_count,
srs.byte_string(q.size) AS size_string,
srs.byte_string(CAST(q.total_hash_size AS BIGINT)) AS total_size_string,
(q.total_hash_size - q.size) AS wasted_space,
srs.byte_string(CAST((q.total_hash_size - q.size) AS BIGINT)) AS wasted_space_string,
q.total_hash_size,
q.size,
q.content_hash
FROM
q
WHERE
(q.item_count >= 2) AND
(q.size > 10 * 1024 * 1024)
ORDER BY q.fullpath Preview Images

Downloads
| Attachment | Size |
|---|---|
| Content Hash Duplicate File Report.zip | 4.67 KB |
Sample Report
| Attachment | Size |
|---|---|
| duplicate file - hash.pdf | 291.53 KB |