I'll explain our intent with an example: Let's say we have the following words:
horse, zebra, cat, bear, table
Which one is the outcast? The table.. But how can a computer system find this? We are talking about the whole English language and any word can be in this list. The number of words in the list can be huge, so the solution should use appropriate data structures and algorithms.
We'll use
WordNet which is a graph of English words. It contains the semantic relationships between words and our concern is the "is a" relationship. For our example, we know that horse and cat are both animals. "Table" is not an animal and it is the outcast in this situation. But we even have the ability to tell that horse and zebra is very close.
Below is a unit test that shows a high level view of our application in Java:
public class OutcastTest {
private WordNet wordNet;
private Outcast outcast;
@Before
public void setup() {
wordNet = new WordNet("C:\\wordnet\\synsets.txt",
"C:\\wordnet\\hypernyms.txt");
outcast = new Outcast(wordNet);
}
@Test
public void shouldFindOutcast() {
String[] nouns = new String[] { "horse", "zebra", "cat", "bear", "table" };
assertEquals("table", outcast.outcast(nouns));
}
}
So we have an Outcast class that uses the WordNet graph to find distances between words. What is distance? Words have common ancestors and the navigational distance from a word to another using the ancestral path is possible. This is the essence of our program. The distance between horse and zebra is less than the distance between horse and table.
Below is the Outcast class and finding the outcast here is the easy part of the program.
public class Outcast {
private WordNet wordNet;
public Outcast(WordNet wordnet) {
if (wordnet == null) {
throw new IllegalArgumentException();
}
this.wordNet = wordnet;
}
public String outcast(String[] nouns) {
if (nouns == null) {
throw new IllegalArgumentException();
}
String result = nouns[0];
int maxTotal = Integer.MIN_VALUE;
for (int i = 0; i < nouns.length; ++i) {
int totalDistance = 0;
for (int j = 0; j < nouns.length; ++j) {
if (i == j) {
continue;
}
totalDistance += wordNet.distance(nouns[i], nouns[j]);
if (totalDistance > maxTotal) {
maxTotal = totalDistance;
result = nouns[i];
}
}
}
return result;
}
}
Now let's have a look at the WordNet class. It's first duty is reading the graph from the text files and creating a directed graph. I'll not explain the details of the file formats but the hypernyms.txt file includes the relationships between words. It is important to note that a word may appear in many places in the graph depending on the context. For example, horse has the meaning of an animal that we know but it also means cavalier. This means we need to consider all the paths that connect two words and we need to find the shortest distance.
Important note: This application is not designed with multithreading in mind. Because of educational simplicity, it is assumed that this application is single-threaded.
public class WordNet {
private Digraph graph;
private SAP sap;
public WordNet(String synsets, String hypernyms) {
if (synsets == null || hypernyms == null) {
throw new IllegalArgumentException();
}
readSynsets(synsets);
graph = new Digraph(synsetList.size());
readHypernyms(hypernyms);
CycleDetector cycleDetector = new CycleDetector(graph);
if (cycleDetector.hasCycle()) {
throw new IllegalArgumentException();
}
if (cycleDetector.multipleRoot()) {
throw new IllegalArgumentException();
}
sap = new SAP(graph);
}
/...
}
So the constructor reads the words, their relationships, creates a directed graph, checks for cycles and finally creates a class called SAP (Shortest Ancestral Path). I'll not include the details of reading from the text file.
Now let's have a look at finding the distance:
public int distance(String nounA, String nounB) {
if (!nouns.contains(nounA)) {
throw new IllegalArgumentException(nounA + " is not a WordNet noun!");
}
if (!nouns.contains(nounB)) {
throw new IllegalArgumentException(nounB + " is not a WordNet noun!");
}
List idsA = idMap.get(nounA);
List idsB = idMap.get(nounB);
return sap.length(idsA, idsB);
}
While reading the words from input files, some data structures were filled to ensure fast retrieval. For example idMap gives the id's of a noun in the WordNet graph. The main calculation is done by the SAP class of course.
The algorithm that is utilized by SAP (Shortest Ancestral Path) is the BreadthFirstSearch algorithm on a directed graph.
Now let's have a look at the SAP class:
public class SAP {
private Digraph graph;
private Map<Integer, Ancestor> ancestors = new HashMap<>();
public SAP(Digraph G) {
if (G == null) {
throw new IllegalArgumentException();
}
// defensive copy!
graph = new Digraph(G);
}
//...
// a common ancestor of v and w that participates in a shortest ancestral path; -1 if no such path
private Ancestor calculateAncestor(int v, int w) {
if (v == w) {
return new Ancestor(w, v, 0);
}
if (v < 0 || v > graph.V() || w < 0 || w > graph.V()) {
throw new IllegalArgumentException();
}
Ancestor cachedAncestor = ancestors.get(v);
// return if cached
if (cachedAncestor != null && cachedAncestor.w == w) {
return cachedAncestor;
}
int[] color = new int[graph.V()];
IntWrapper ancestor = new IntWrapper(-1);
ColoringBreadthFirstSearch bfs0 = new ColoringBreadthFirstSearch(
graph, v, 0, color, new int[graph.V()], ancestor);
ColoringBreadthFirstSearch bfs1 = new ColoringBreadthFirstSearch(graph, w, 1, color, bfs0.distTo,
ancestor);
int lengthValue = -1;
if (ancestor.value != -1) {
lengthValue = bfs0.distTo[ancestor.value] + bfs1.distTo[ancestor.value];
}
Ancestor calculatedAncestor = new Ancestor(w, ancestor.value, lengthValue);
ancestors.put(v, calculatedAncestor);
return calculatedAncestor;
}
//...
}
Here we see that the calculateAncestor method is a bit complicated. There is room for simplification and maybe I'll work on that later but the main idea is performing breadth-first search two times to find the common ancestor. There are different
algorithms for finding the common ancestor.
We'll go on by explaining breadth-first search.
public class BreadthFirstSearch {
private boolean[] marked; // is there an s->v path?
private int[] edgeTo; // edgeTo[v] = last edge on shortest s->v path
private int[] distTo; // distTo[v] = length of shortest s->v path
public BreadthFirstSearch(Digraph G, int s) {
marked = new boolean[G.V()];
distTo = new int[G.V()];
edgeTo = new int[G.V()];
for (int v = 0; v < G.V(); v++)
distTo[v] = Integer.MAX_VALUE;
validate(s);
bfs(G, s);
}
private void bfs(Digraph G, int s) {
Queue<Integer> q = new Queue<Integer>();
marked[s] = true;
distTo[s] = 0;
q.enqueue(s);
while (!q.isEmpty()) {
int v = q.dequeue();
for (int w : G.adj(v)) {
if (!marked[w]) {
edgeTo[w] = v;
distTo[w] = distTo[v] + 1;
marked[w] = true;
q.enqueue(w);
}
}
}
}
public boolean hasPathTo(int v) {
validate(v);
return marked[v];
}
public int distTo(int v) {
validate(v);
return distTo[v];
}
public Iterable<Integer> pathTo(int v) {
validate(v);
if (!hasPathTo(v)) return null;
Stack<Integer> path = new Stack<Integer>();
int x;
for (x = v; distTo[x] != 0; x = edgeTo[x])
path.push(x);
path.push(x);
return path;
}
}
This algorithm starts from the given vertex (source) and it visits every connected vertex in a shortest path. And while traversion, it saves the collected information in array structures.
Below are the steps:
- Add the source vertex to the queue.
- Dequeue, add all neighbors that are not marked as "visited" to the queue.
- Do this until the queue is empty.
We first visit all the neighbors and only after that we are visiting neighbors of neighbors. This is the opposite of depth-first search. In the depth-first search algorithm, we first go the deepest vertex. This is done by using a stack (or recursive call). After a neighbor is added to the stack, it will pop and all it's neighbors will be added etc.
SUMMARY:
Graph data structure is used in many domains and one of them is computer linguistics. When we create a semantic graph of words, we can use breadth-first search algorithm to calculate important semantic information.
Comments