Using Sets to Handle Duplication in Large Scale ETLs

Introduction

We’ve all been in the situation where you’re processing a ton of information for your ETLs, but you’re dealing with a lot of duplicates in the data. There are so many duplicates that you’re actually running out of memory or your database updates are taking forever. Ho do you deal with them? Do you throw all the entries into a list or an array and check to see if it’s been updated? That gets really expensive really fast as you iterate. Do you check against your database to see if it’s already been inserted? As you can probably guess, this takes for-ev-er. So what’s an alternative? Using Sets.

Implementation

For this blog entry, we’re going to take a pretty simple example of finding unique words in customer success emails. We later want to perform some sentiment analysis on these words to see how our customers perception of us is changing over time. Is this example contrived? Sure, but these principles are pretty easily extendable to any similar issues you might be facing.

What is a set?

A set is a simple data structure that can contain only one distinct value. For example, if we have an array of integers – [1,2,3,3,3,3,4,5,5,6] – and add it to a set, the result will be a data structure containing the values – [1,2,3,4,5,6]. If we have an array of strings – [‘what’, ‘the’, ‘what’, ‘said’, ‘liz’, ‘lemon’] – the result will be a data structure containing – [‘what’,’the’,’said’,’liz’,’lemon’]

Set’s are also notable in that they are constant time – O(1) – for inserts, deletions, and lookups. Lists… lists technically have an O(N) for inserts, but, practically speaking, doubly linked lists (LinkedList in Java) are constant time for appending to the end of the list. Array lists (ArrayList in Java) have an O(N) insert time if the underlying array needs to be resized. LinkedLists are O(N) for lookups and deletions, and ArrayLists are O(1) for lookup by index and O(N) for lookup by object; ArrayLists are also O(N) for deletions.

Given the running time of these algorithms, you can see that checking for duplicates for each value we’re processing is going to be a O(N2), but checking for duplicates in Sets for each value we’re processing is going to be a O(N) algorithm. In other words, checking for 1 million duplicates as their added in Lists is going to require 1 trillion operations; contrasting that with a Set, checking for 1 million duplicates as their added will only require 1 million operations.

Coding it out in Java

This is fine for simple data types, but in Java objects are given a name at construction time, something like “object-12′ or ‘object-521’ which is used as it’s hash entry. These strings are guaranteed to be unique, and they are what Java uses by default for determining a key into the set. So how do we get around this?

Internally, every Object in Java has two methods, hashCode() and equals(Object o) that must be overridden before the Set will know how to handle the object correctly. When adding, getting, or removing an entry from the Set, Java will first check to see if there’s a matching entry with the same hashCode() value. If so, it will then check to see if the two values are equal. In practice, it’s unlikely that there will be a hashCode() collision, so the equals function is mostly there as a safeguard just in case.

The class for the customer success word will look like the following:

 


public class CustomerSuccessWord {
	private int successId;
	private String word;
	private int hashCode;

	public CustomerSuccessWord(int successId, String word) {
		this.successId = successId;
		this.word = word;
		this.hashCode = (successId + "\t" + word).hashCode();
	}

	@Override
	public int hashCode() {
		return this.hashCode;
	}

	@Override
	public boolean equals(Object o) {
		boolean retval = false;
		if (o instanceof CustomerSuccessWord) {
			CustomerSuccessWord csw = (CustomerSuccessWord) o;
			retval = (this.successId == csw.successId)
					&& (this.word.equals(csw.word));
		}
		return retval;
	}
}

Assuming we’re pulling this information from a database, the code that puts this information into the set will look something like:


Set<CustomerSuccessWord> set = new HashSet<CustomerSuccessWord>();
Statement statement = connection.createStatement();
ResultSet set = statement.executeQuery("select id, text from customer_success");
while (set.next()) {
	int customerSuccessId = set.getInt("id")
	String text = set.getString("text")
	StringTokenizer tok = new StringTokenizer(text);
	while (tok.hasMoreTokens()) {
		String word = tok.nextToken();
		set.add(new CustomerSuccessWord(customerSuccessId, word);
	}
}
set.close();
statement.close();

Conclusion

Using these techniques, you can ensure that duplicates are handled quickly and efficiently, so you can maximize both your memory footprint as well as potentially database inserts