This article explains how segment delimitation rules work in Déjà Vu. For information about how to modify the delimitation rules, read this article.
Déjà Vu splits sentences automatically into separate segments by looking through the text until it finds text that matches any of the segmentation rules it finds, and then splitting the sentence into separate segments at that point. Before performing the split, however, Déjà Vu will check if the text it has found also matches one of the exception rules it has, and if it does, Déjà Vu will not split the sentence, and just keep reading. To define delimitation rules, you can use any actual letter, plus a few symbols that Déjà Vu recognizes that are used to represent special characters or groups of characters.
|^#||a digit (1, 2, 3...)|
|^$||a letter (upper-case, lower-case, or any case)|
|^a||a lower-case letter|
|^A||an upper-case letter|
|^^||the caret character (^) itself|
Examples of the symbols in use
Here are the default delimiters, and their exceptions, that Déjà Vu uses for American English:
Let's look at the first rule. The characters "!" stands for itself, an exclamation mark. The symbol ^w represents a white space. This means that whenever Déjà Vu finds an exclamation mark followed by a white space, it will split the sentence after the exclamation mark and the space. Therefore, the text:
Will be split into:
Notice that there are two columns for each rule, "Before split" and "After split". In the "Before Split" column you put what Déjà Vu should look for right before the place where the split will happen, and in "After split" you put what Déjà Vu should look for after the place where the split would happen. To illustrate how this works, suppose we had this rule instead:
In that case, Déjà Vu would split text at a place where there is an exclamation mark followed by a space before the split, and a capital letter A after the split (remember that A and ^A are different things!). With this rule:
Will not be split. However:
Will be split, right before the capital A.
Exceptions are applied right after the rules. If Déjà VU finds text that matches one of the rules, it will check to see if it also matches an exception. If it does, Déjà Vu will not split, but if it doesn't, it will go on and split the text.
Let's look at the first exception. It says that Déjà Vu must make an exception is there is an exclamation mark followed by a white space before the place where the split would happen, and a lower-case letter after the place where the split would happen. Without the exception in effect, text like:
Use the big! service.
Would be split after the exclamation mark and the space, but since the exception exists, is will not be split. If the word "service" began with an upper-case S, the text would have been split.
Uses of rules and exceptions
Avoid splitting a sentence at "P.O. Box"
As an example of what can be done by creating your own rules and exceptions, let's consider what happens if you have text that contains the words "P.O. Box", such as:
Acme can make deliveries to a P.O. Box as well as a physical address.
With the default rules for American English, that will be split into:
Acme can make deliveries to a P.O.
Box as well as a physical address.
This happens because Déjà Vu finds a full stop followed by a white space (after "P.O."), which means it will split after that, and the text that follows ("Box") does not match any of the exceptions. How could this be avoided? Consider the following exception:
If that exception were in effect, when Déjà Vu determines that the position after "P.O. " but before "Box" is a candidate for splitting, it will ask:
Does the text before the place where I will split contain the letters "P.O." followed by a whitespace?
It does. It will also ask:
Does the text after the place where I will split contain the letters "Box"?
Indeed, it does. Therefore, Déjà Vu will not split the text here, and instead it will move on to find some other place to split further along the text.
Avoid splitting a German sentence at "z.B."
What if you were translating German text, and this text happened to contain the abbreviation "z.B."? To avoid having Déjà Vu split the text after each instance of this abbreviation, you could use the following exception:
With this exception created, every time Déjà finds the abbreviation "z.B." and considers splitting the text right after it, it will then realize that the abbreviation matches this exception and it will leave it alone.