MM5.9 Duplicate Detection

 

Demo

chrome_ZqiI00yRcR.mp4

 

Configuration

Duplicate detection has been enhanced with additional detection types.
If duplicate detection is enabled, the chosen duplicate detection types will be used when an asset is uploaded.

In Portal config manager settings:

  1. Click Enable duplicate detection

  2. Choose from the following Duplicate detection types

  • Perceptual hash

  • Sha1 hash

  • Use metadata field

  • Original filename

  1. Depending on the types chosen, customize further options. These will be ignored if the related detection type has not been selected.

    1. Use metadata field > Specify field used for comparison

    2. Original filename > Choose with or without extension

    3. Perceptual hash > Specify threshold for detection

 

Screenshot 2024-01-12 at 11.17.05.png

Detection types

Perceptual hash

Perceptual hash is a hashing algorithm designed for users to easily identify identical or visually similar/related images. This algorithm is particularly useful for locating duplicate images that have undergone various alterations, such as changes in color, minor cropping, subtle photoshopping, resizing, or skewing. Users can leverage perceptual hashing to compare and categorize images efficiently.

The perceptual hash threshold serves as a crucial parameter for determining when an image is considered a duplicate of the asset being uploaded into the system. This threshold sets the minimum similarity level required for an image to be identified as a duplicate. By default, the threshold is set to 80%, ensuring that only images with a high degree of similarity to the uploaded asset will be returned as duplicates. This default value helps to filter out only the most closely matching duplicates, guaranteeing the identification of the genuine duplicate of the uploaded asset. Users have the flexibility to adjust this threshold as needed to meet their specific requirements for image matching.

Sha1 hash

Sha1 hash is a hashing algorithm employed to pinpoint true identical files that correspond precisely to the asset being uploaded. Unlike perceptual hash, which focuses on identifying visually similar images with slight modifications, Sha1 hash is primarily used to ensure the exact replication of files. It is particularly valuable when absolute file integrity and authenticity are of paramount importance. By comparing the Sha1 hash values of files, users can ascertain whether two files are identical down to the byte, offering a robust method for file identification and verification.

Use metadata field

The "Use metadata field" feature empowers users to choose a specific metadata field to be utilized for detecting duplications.

When you click the "Duplicate detection metadata field" button, you will be presented with the following image:

Within this functionality, users have the flexibility to select from a range of metadata fields, all of which are string-based attributes. These metadata fields can encompass various descriptive information associated with the asset, such as the crop name or the asset's title, and are then compared against the corresponding metadata of the asset being uploaded.

Original filename

The "Original filename" feature provides users with the ability to identify duplicate images by utilizing the filename associated with assets that have been previously uploaded.

Users have the option to choose between two settings in the "File extension type" dropdown:

  1. Without Extension: When "without extension" is selected, the file extension of the matched assets will be disregarded during the duplication detection process. This means that assets with different file extensions from the one being uploaded can still be considered duplicates if their filenames match.

  2. With Extension: Conversely, when "with extension" is selected, both the filename and file extension of the matched assets will be taken into account. Only assets with the same filename and the same file extension as the asset being uploaded will be returned as results from the matched assets.

These options provide users with flexibility in how they wish to identify duplicate images based on filename, allowing them to tailor the duplication detection process to their specific needs and preferences.