Trie Data Structure and Finding Patterns in a Collection of Words

- April 17, 2021

I faced a very hard problem recently about finding substrings based on a collection of patterns.

Example

For words = ["Apple", "Melon", "Orange", "Watermelon"] and parts = ["a", "mel", "lon", "el", "An"], the output should be findSubstrings(words, parts) = ["Apple", "Me[lon]", "Or[a]nge", "Water[mel]on"].

While "Watermelon" contains three substrings from the parts array, "a", "mel", and "lon", "mel" is the longest substring that appears first in the string.

Note: Number of words and number of parts can be huge! A brute force method or careless usage of Java String methods would make it impossible to handle a large number of words and parts (patterns).

Solution Approach

Whenever we want to search patterns inside a string, we should think about the Trie data structure:

https://en.wikipedia.org/wiki/Trie

In computer science, a trie, also called prefix tree, is a type of search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters.

public class TrieST<Value> {

private static final int R = 256;

private Node root;

private int n;

public static class Node {

public Object val;

public Node[] next = new Node[R];

}

public void put(String key, Value val) {

if (key == null) throw new IllegalArgumentException("first argument to put() is null");

else setRoot(put(getRoot(), key, val, 0));

}

private Node put(Node x, String key, Value val, int d) {

if (x == null) x = new Node();

if (d == key.length()) {

if (x.val == null) n++;

x.val = val;

return x;

}

char c = key.charAt(d);

x.next[c] = put(x.next[c], key, val, d+1);

return x;

}

public Node getRoot() {

return root;

}

public void setRoot(Node root) {

this.root = root;

}

Figuring out how to use this data structure to find patterns in a collection of words is a good exercise. The idea is handling them word by word of course. And we need a for loop based on each character in the word. This loop should also traverse the Trie structure and while doing that we need to keep track of the matches we found. Note that the values inside the trie represents the parts.

Search This Blog

COMPUTER ENGINEERING DIARIES

The WeakReference class, monitoring memory leak and garbage collection in a Java application