Recently in one of my projects I had to remove duplicates from an array. Instead of looking up a npm package that could do it for me I chose to implement it myself. Here’s what I learned about this seemingly straightforward task.
The easiest way
The Set
structure introduced in ES6 has made it unbelievably easy to remove duplicates from an array. It can be done with this one-liner:
function removeDuplicates<T>(collection: T[]) {
return [...new Set(collection)];
}
The Set
data structure takes an array of elements and removes all the duplicates from it. Set
can handle both primitives and references, however it removes duplicates based on the ===
comparison so objects are only considered a duplicate if they reference the same place in memory.
Once Set
has removed the duplicates, we can transform the Set
back into an array by using the spread operator. The Set
instance contains property [Symbol.iterator]
which makes it possible for the object to be iterated and thus to transform it into an array.
A significant downside
The problem is that in a typical application we almost never check whether we deal with the same object based on its reference. Objects are mostly used to represent some entity and its properties. Those entities are usually stored in the database and have some kind of identifier that can be used to differentiate between them.
Let’s take a look at this scenario:
We have an array of objects that represent some entities, each having a unique property called id
. After a call to an API was made three new objects need to be saved inside the array. We would like to not repeat the same entity twice so we call our previously defined function in order to remove the duplicated ones:
const state = [{ id: 1 }, { id: 2 }, { id: 3 }];
const incomingObjects = [{ id: 1 }, { id: 4 }, { id: 5 }];
removeDuplicates([...state, ...incomingObjects]);
// => [ { id: 1 }, { id: 2 }, { id: 3 }, { id: 1 }, { id: 4 }, { id: 5 } ]
The function works as it is supposed to - it removes objects with the same reference - but in our case there were two objects that represented the same entity and they didn’t have the same reference. We would like to be able to choose what we want the differentiation to be based on. With this specific behavior in mind, let’s create a new function.
Removing duplicates on our own
As we have seen before, objects usually have some kind of an identifier field that can be used to distinguish the object. If an object doesn’t have such information stored in one of its properties then there really is no other way than to check the reference.
The basic idea for our function is as follows: a function provided as the second argument will return an identifier by which we will check for duplicates.
So let’s create a new removeDuplicates
function:
type GetIdentifier<T> = (value: T) => any;
const defaultGetIdentifier: GetIdentifier<any> = (val) => val;
function removeDuplicates<T>(collection: T[], getIdentifier: GetIdentifier<T> = defaultGetIdentifier) {
const identifierState: { [identifier: string]: boolean } = {};
return collection.filter((value) => {
const identifier = String(getIdentifier(value));
if (identifierState[identifier]) {
return false;
}
identifierState[identifier] = true;
return true;
});
}
Our function takes two arguments: collection
which is an array of elements and, optionally, getIdentifier
- a function that takes an element of the collection and returns its identifier.
Inside the identifierState
object we keep the identifiers and, for each identifier, a boolean value that, when set to true, blocks all the future values with the same identifier to be saved in the resulting array.
identifier
must be either an string or a number because we have to store it as a key inside the identifierState
.
Finally, we use the Array.prototype.filter()
method to make sure we return a new array. In the function provided to the method we first get the identifier of the current value and check whether it exists inside the state. If it does, we return false and therefore not include the value in the resulting array. If it doesn’t we know that the value, according to our identifier of choice, isn’t a duplicate, so we can set its identifier to true
in the state and return true for it to be saved in the final array.
Let’s test it out
Primitive values
The second argument defaults to the defaultGetIdentifier
function that works great with primitive values (because it returns the value itself as its identifier):
removeDuplicates([1, 2, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 2]); // => [ 1, 2 ]
Objects
Coming back to the previous example:
const state = [{ id: 1 }, { id: 2 }, { id: 3 }];
const incomingObjects = [{ id: 1 }, { id: 4 }, { id: 5 }];
removeDuplicates([...state, ...incomingObjects], (val) => val.id);
// => [ { id: 1 }, { id: 2 }, { id: 3 }, { id: 4 }, { id: 5 } ]
By mapping each object to its id
property they can be correctly identified and thus objects with id
equal to 1 are not repeated.
Trouble is now we won’t be able to compare objects by their reference. Should we not include any specific function as the second argument, the default getIdentifier
will simply return the object itself. Then the following line will set identifier
to a string: [object Object]
.
const identifier = String(getIdentifier(value));
And by this criterion all objects are the same and thus the resulting array will only contain the first one. Dealing with this is a topic for a different time, however.
Comparing the two solutions
Due to the differences in how our function work we can only compare their efficiency with arrays that contain primitive values.
In an array with a million numbers between 1 and 1000 generated randomly, the duplicates were removed in the following times:
solution with identifiers: 52.798ms
solution with set: 23.345ms
Keep in mind that you will probably never have to deal with arrays of this magnitude and for arrays with say 100 elements the difference is hardly noticeable.
Conclusion
The Set
solution is both simpler and faster but falls short when it comes to differentiating objects based on anything other than reference. The second solution is way more flexible and can be applied to more complicated cases. The solution can handle essentially everything aside from comparing the references. It is always important to know the cases in which a given function is going to be used and prepare accordingly. Filtering objects by reference is very rare nowadays so I would just stick to the second solution.
Also available on Medium.