Tutorial

The tutorial will show you how to setup and use basic functionality of the Convenient Regular Expressions (CRE for short).

We will be using node.js (with npm), JavaScript and ES modules. See the "Installing" section in the documentation for other options.

Alternatively, if you don't want to install anything yet, you can use our on-line demo page to run the code. Use try it buttons to open the sample.

Before you begin you should know basics of the standard JavaScript's regular expressions (RegExp object).

Install

Start by creating a new project.

mkdir cre-tutorial
cd cre-tutorial
npm init

Now, install the Convenient Regular Expressions con-reg-exp package.

npm install con-reg-exp

Import

Create a new script file in the project directory and open it in your favorite editor. Let's call it tutorial-1.mjs. The .mjs file extension is recommended since you will be using ES modules.

Use import statement to import the con-reg-exp module:

import cre from "con-reg-exp";
try it

The only identifier that you need to import is cre. The module does not export anything else, so the above line will never change.

Note

Some packages manager may have issues with default importing cre. In such cases you can try importing by name:

import { cre } from "con-reg-exp";
try it

Your first Convenient Regular Expression

The cre identifier is a tag function that you will use in tagged template containing actual expression. The syntax is following:

const myFirstExp = cre`... expression goes here ...`;

As the result, the myFirstExp constant will contain a standard RegExp object created from the expression inside backticks `...`.

The simplest expression in CRE is a string literal, which looks the same as in JavaScript. We can use it to create an expression matching specific word.

const myFirstExp = cre`"World"`;

Let's use it to do string replacement. The final tutorial-1.mjs file will look like this:

import cre from "con-reg-exp";

const myFirstExp = cre`"World"`;

const input = "Hello World!!!";
const result = input.replace(myFirstExp, "Convenient Regular Expressions");

console.log(result);
try it

Now, run it:

node tutorial-1.mjs

The expected output:

Hello Convenient Regular Expressions!!!

The myFirstExp constant contains the standard regular expression. To verify what was created from our expression we can print it. Add the following line at the end of your script:

console.log(myFirstExp);
try it

Run the script again and the expected output is now:

Hello Convenient Regular Expressions!!!
/World/msu

There are some aliases that you can use instead of writing it explicitly in quotes, for example cre`"\n"` can be replaced by cre`new-line` or cre`nl`.

Flags

You may notice in the example above that the CRE added some flags to the RegExp object. The m and s flags are controlled automatically by the CRE based on your expression. You don't need to worry about them. You can control the remaining flags by specifying them after the cre tag.

The flags syntax is following:

const pattern = cre.flag1.flag2.flag3`... expression ...`;

You can read more about flags in the documentation. For now, we will use ignoreCase flag to allow both upper and lower case letters in our replacement:

const myFirstExp = cre.ignoreCase`"world"`;
try it

Run the script and the expected output is now:

Hello Convenient Regular Expressions!!!
/world/imsu

Character classes

Another simple expression in the CRE is character class that works exactly the same and has the same syntax as standard RexExp character class.

Use it now to remove all the non-english letters from a string. Add the following line to the end of your script:

console.log(result.replace(cre.global`[^a-zA-Z]`, ''));
try it

Run it and you should see:

Hello Convenient Regular Expressions!!!
/world/imsu
HelloConvenientRegularExpressions

You may noticed that we added global flag. It works exactly the same as g RegExp flag.

Character class keywords

You can use keywords that defines a character class. You can find more details in the documentation, but for now we just use whitespace.

We will replace all the whitespace characters with underscore. Add it to the end of the script.

console.log(result.replace(cre.global`whitespace`, '_'));
try it

The last line of output is now:

...
Hello_Convenient_Regular_Expressions!!!

Now, let's do something opposite. Replace all non-whitespace characters with a question mark. Add the following to the script:

console.log(result.replace(cre.global`not whitespace`, '?'));
try it

Now, output ends with:

...
Hello_Convenient_Regular_Expressions!!!
????? ?????????? ??????? ??????????????

We used here not operator. This operator can be added before different kinds of expressions to negate its meaning. In this case, we got complement of a character class. It can be applied to any character class, for example: cre`not [a-z]` is equivalent of cre`[^a-z]`.

Combine the expressions

We want to combine our knowledge in one expression. Create an expression that replaces both gray and grey words with some other word.

Start again with a new script called tutorial-2.mjs:

import cre from "con-reg-exp";

const mySecondExp = cre.global`"gr", [ae], "y"`;

const input = "The Englishman's hair is grey. The American's hair is gray.";
const result = input.replace(mySecondExp, "white");

console.log(result);
console.log(mySecondExp);
try it

Now, run it:

node tutorial-2.mjs

And, the expected output is:

The Englishman's hair is white. The American's hair is white.
/gr[ae]y/gmsu

Now, look at our expression:

const mySecondExp = cre.global`"gr", [ae], "y"`;

We have "gr" string literal, [ae] character class, and "y" string literal. As you can see, the expressions are separated with the comma ,. You can also use semicolon ; which is recommended for multiline expressions, so give it a try and rewrite the expression in multiline form:

const mySecondExp = cre.global`
    "gr";
    [ae];
    "y";
`;
try it

Run it again and you should see exactly the same output.

There is one additional separator at the end of the expression. It is redundant, but it should be there for consistency. The redundant separators are ignored.

As in JavaScript, semicolons at the end of line are optional, so the same expression without semicolons will be:

const mySecondExp = cre.global`
    "gr"
    [ae]
    "y"
`;
try it

This example looks better in single-line form, but when expression grows it is much better to use multiple lines.

Operator or

The or operator is equivalent to | in standard RegExp.

We can rewrite previous expression to use or instead of character class.

const mySecondExp = cre.global`"gray" or "grey"`;
try it

Run it and you will get:

The Englishman's hair is white. The American's hair is white.
/gray|grey/gmsu

We apply or operator on entire words in the expression above. Now, try to apply or only to letters are different.

const mySecondExp = cre.global`"gr", "a" or "e", "y"`;
try it

The output is now:

The Englishman's hair is white. The American's hair is white.
/gr(?:a|e)y/gmsu

Looking at the expression you may notice that or operator has higher precedence than comma. The or operator was first applied to "a" and "e", then it is separated by commas. That is why output RegExp contains group (?:a|e).

Grouping

Now, we will add silver word to our expression. Additionally, we need to extend our sample string.

import cre from "con-reg-exp";

const mySecondExp = cre.global`("gr", "a" or "e", "y") or "silver"`;

const input = `The Englishman's hair is grey
    The American's hair is gray.
    Some other guy has silver hair.`;
const result = input.replace(mySecondExp, "white");

console.log(result);
console.log(mySecondExp);
try it

Run the script and you will see:

The Englishman's hair is white
    The American's hair is white.
    Some other guy has white hair.
/gr(?:a|e)y|silver/gmsu

Have a look at our expression now:

const mySecondExp = cre.global`("gr", "a" or "e", "y") or "silver"`;

We surrounded the expression responsible for gray and grey with the parentheses ( ... ) before adding or "silver". If we would miss the parentheses, the or operator will apply to the "y" literal only.

You can also use the braces { ... }, but those are recommended for groups that span multiple lines. We can rewrite the expression for multiple lines:

const mySecondExp = cre.global`
    {
        "gr", "a" or "e", "y";
    } or {
        "silver";
    }
`;
try it

The braces around "silver" are not required, but using them will make the expression more readable.

Comments

The comments in code are important. They helps to understand the code. This is especially important when the code is not self-explanatory. This great feature is not available in the standard RegExp, but Convenient Regular Expression allows them.

The comments works exactly the same as in JavaScript. Use // or /* */.

Add some comments to our last expression. The expression is so simple that the comments are not needed, but let's do it anyway.

const mySecondExp = cre.global`
    /*
     * The expression matches "silver" and "gray" in both forms.
     */
    {
        "gr", "a" or "e", "y"; // The "gray" or "grey" word.
    } or {
        "silver"; // The "silver" word.
    }
`;
try it

As expected, you should get the same output if you run the script with the changes above.

Quantifiers

Quantifiers tell how many times the following expression should be matched.

Start with the optional quantifier (? in the standard RegExp). We will now split lines. The input can have either Windows or UNIX line endings, which is \r\n or \n. The \n character is always present, but the \r character is optional.

Create a new tutorial-3.mjs file for this:

import cre from "con-reg-exp";

const splitPattern = cre`optional \r, \n`;

const input = `
    Two households, both alike in dignity,
    In fair Verona, where we lay our scene,
    From ancient grudge break to new mutiny,
    Where civil blood makes civil hands unclean.
`;
const result = input.split(splitPattern);

console.log(result);
console.log(splitPattern);

try it

The output is:

[
  '',
  '    Two households, both alike in dignity,',
  '    In fair Verona, where we lay our scene,',
  '    From ancient grudge break to new mutiny,',
  '    Where civil blood makes civil hands unclean.',
  ''
]
/\r?\n/msu

The repeat quantifier (* in the standard RegExp) matches any number of times.

We will divide our text using both new lines and commas, but we want to get rid of all the spaces around the separator.

const splitPattern = cre`
    repeat space;
    (optional \r, \n) or ",";
    repeat space;
`;
try it

The output is now:

  '',
  'Two households',
  'both alike in dignity',
  '',
  'In fair Verona',
  'where we lay our scene',
  '',
  'From ancient grudge break to new mutiny',
  '',
  'Where civil blood makes civil hands unclean.',
  ''
]
/ *(?:\r?\n|,) */msu

We can see empty strings when comma and new line are next to each other. We will handle by repeating the expression at least once.

const splitPattern = cre`
    at-least-1 {
        repeat space;
        (optional \r, \n) or ",";
        repeat space;
    }
`;
try it

The previous expression was surrounded by braces and after that at-least-1 quantifier was applied to the entire group. You can put any number at the at-least- quantifier.

Have a look if this modification does what we wanted:

[
  '',
  'Two households',
  'both alike in dignity',
  'In fair Verona',
  'where we lay our scene',
  'From ancient grudge break to new mutiny',
  'Where civil blood makes civil hands unclean.',
  ''
]
/(?: *(?:\r?\n|,) *)+/msu

There are a few other quantifiers, but since you already know general concept, we will not go into details. Those are the quantifiers:

  • at-most-5 matches from zero to five times.

    For example, cre`begin-of-line, "#", at-most-5 "#", whitespace will match beginning of markdown headings.

  • 1-to-6-times matches from one to six times.

    For example, cre`begin-of-line, 1-to-6-times "#", whitespace will match beginning of markdown headings, so it is exactly the same as above.

  • 3-times matches exactly three times.

    For example, cre`begin-of-line, 3-times "#", whitespace will match beginning of markdown heading level 3.

I used begin-of-line assertion in the examples above. You probably guessing what this is, but more about assertions later.

Lazy Quantifiers

Create yet another script called tutorial-4.mjs. We will try to extract any quotations from input text.

import cre from "con-reg-exp";

const quotesExtract = cre.global`["], repeat any, ["]`;

const input = `
    "Gregory, o' my word, we'll not carry coals." said Sampson.
    "No, for then we should be colliers." replied Gregory.
`;
const result = input.match(quotesExtract);

console.log([...result]);
console.log(quotesExtract);
try it

We used character class here to match quotation mark ["], because it is simpler than "\"". It will translate into the same regular expression anyway.

The second interesting part is any character class. It always matches any character. It is an improvement over standard RegExp where . character depends on flags and you had to keep in mind the flags when reading the expression.

Run it and see the output:

[
  `"Gregory, o' my word, we'll not carry coals." said Sampson.\n` +
    '    "No, for then we should be colliers."'
]
/".*"/gmsu

That's not what we wanted. We are expecting two strings in the array. This happens because repeat quantifier is greedy and it will consume as much as possible. We need to use "lazy" quantifier instead.

To make quantifier "lazy" add lazy- prefix to it. Our expression is now:

const quotesExtract = cre.global`["], lazy-repeat any, ["]`;
try it

The output is correct now:

[
  `"Gregory, o' my word, we'll not carry coals."`,
  '"No, for then we should be colliers."'
]
/".*?"/gmsu

The lazy- prefix can be applied to any quantifier.

Capturing

We can see quote characters in the output above. Let's improve it by extracting only inner part of the quote.

We can do it with named capturing group. The syntax is similar to JavaScript's label.

const quotesExtract = cre.global`
    ["];
    quotation: lazy-repeat any;
    ["];
`;

The quotation: applies to the following expression. In this case all the characters inside the quote.

To see the groups, we need to modify the script a little bit more:

import cre from "con-reg-exp";

const quotesExtract = cre.global`
    ["];
    quotation: lazy-repeat any;
    ["];
`;

const input = `
    "Gregory, o' my word, we'll not carry coals." said Sampson.
    "No, for then we should be colliers." replied Gregory.
`;
const result = input.matchAll(quotesExtract);

console.log([...result].map(m => m.groups.quotation));
console.log(quotesExtract);
try it

And see the result:

[
  "Gregory, o' my word, we'll not carry coals.",
  'No, for then we should be colliers.'
]
/"(?<quotation>.*?)"/gimsu

Now, extract also first word from the quote.

const quotesExtract = cre.global`
    ["];
    quotation: {
        firstWord: repeat word-char;
        lazy-repeat any;
    }
    ["];
`;

The expression in quotation: capturing group become more complex, so it is now surrounded with braces. The word-char is a character class that is equivalent to \w in standard RegExp.

We need to adjust the output to see all the groups:

console.log([...result].map(m => m.groups));
try it

And finally we will get:

[
  [Object: null prototype] {
    quotation: "Gregory, o' my word, we'll not carry coals.",
    firstWord: 'Gregory'
  },
  [Object: null prototype] {
    quotation: 'No, for then we should be colliers.',
    firstWord: 'No'
  }
]
/"(?<quotation>(?<firstWord>\w*).*?)"/gmsu

Positional capturing group uses integer as a name. As usual in regular expressions, it starts with 1. Zero is reserved for entire match. The CRE will check if capturing groups are correctly numbered.

We can do exercise using positional capturing groups. Let's replace quotation marks "..." with «...». Prepare a new file tutorial-5.mjs for this:

import cre from "con-reg-exp";

const quotesReplace = cre.global`
    ["];
    1: lazy-repeat any;
    ["];
`;

const input = `
    "Gregory, o' my word, we'll not carry coals." said Sampson.
    "No, for then we should be colliers." replied Gregory.
`;
const result = input.replace(quotesReplace, '«$1»');

console.log(result);
console.log(quotesReplace);
try it

We referenced capturing first group in the replacement string by $1. The output is as expected:

    «Gregory, o' my word, we'll not carry coals.» said Sampson.
    «No, for then we should be colliers.» replied Gregory.
/"(.*?)"/gsu

When to use named and when positional capturing groups?

  • If you can access groups by its name, prefer named capturing groups since it will make your code more readable and immune to mistakes. For example match, exec functions.
  • If you cannot access groups by its name and you are using positions, prefer positional capturing groups since CRE will check if you ordered them correctly.

// TODO: Explain how to pass groups to replacement function.

// TODO: Backreference

Assertion

The assertions does not consume any characters from the input, but instead asserts that specific conditions are met.

Boundary

There are five boundary assertions: begin-of-text, end-of-text, begin-of-line, end-of-line, word-boundary. Unlike standard regular expressions, their meaning is not changing depending on the flags.

// TODO: Simpler example at the beginning.

Line boundary assertions follows the JavaScript rules for ^ and $ with m flag enabled. This means that both \r and \n are interpreted as separate line endings. You must keep it in mind for Windows line endings \r\n.

// TODO: Reconsider adding something like universal-begin-of-line for better Windows line endings.

Now, we will make a script that prints a HTML containing the input string with annotations w, bol, eol, bot, eot for word, line, and text boundaries. Create script called tutorial-6.mjs.

import cre from "con-reg-exp";

const input = `
Two households, both alike in dignity,
In fair Verona, where we lay our scene,
From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean.
`;

const boundaries = cre.global`
    bot: begin-of-text
    or eot: end-of-text
    or bol: begin-of-line
    or eol: end-of-line
    or word: word-boundary;
`;

const result = input.replace(boundaries, (_, bot, eot, bol, eol, word) => {
    if (bot !== undefined) return '<sub>bot</sub>';
    if (eot !== undefined) return '<sub>eot</sub>';
    if (bol !== undefined) return '<sub>bol</sub>';
    if (eol !== undefined) return '<sub>eol</sub>';
    if (word !== undefined) return '<sub>w</sub>';
});

console.log(`<pre>${result}</pre>`);
console.log(`<pre>${boundaries.toString().replace(/</g, "&lt;")}</pre>`);

The expression is actually pretty simple. It matches all possible boundaries and assigns them a capturing group name.

The replacement function returns HTML <sub> element containing type of boundary. The expression contains only assertions, so it will not consume any characters.

Run it redirecting output to an HTML file:

node tutorial-6.mjs > tutorial-5.html

When you open the HTML file you should see something like that:

bot
bolTwow whouseholdsw, wbothw walikew winw wdignityw,eol
bolInw wfairw wVeronaw, wwherew wwew wlayw wourw wscenew,eol
bolFromw wancientw wgrudgew wbreakw wtow wneww wmutinyw,eol
bolWherew wcivilw wbloodw wmakesw wcivilw whandsw wuncleanw.eol
eot

If you look closely, you may notice that something is wrong with the annotations that we generated. We have at most one annotation in one place, but we should have more. For example, in the same place where each bol is, we should also see w annotation.

This is because the or operator selects only one matching part.

Since the assertion does not consume characters, we can use it multiple times. We can separate them with semicolons:

bot: begin-of-text;
eot: end-of-text;
bol: begin-of-line;
eol: end-of-line;
word: word-boundary;

The problem is that each of assertions must be fulfilled. We can help with the not operator mentioned before. It also works for assertions. With it, we can ensure that each line contains expression that is always fulfilled.

(bot: begin-of-text) or (not begin-of-text);
(eot: end-of-text) or (not end-of-text);
(bol: begin-of-line) or (not begin-of-line);
(eol: end-of-line) or (not end-of-line);
(word: word-boundary) or (not word-boundary);

Each line contains two mutually exclusive expressions. One of them contains capturing group.

The parenthesis are not required here since the or operator has lower precedence than capturing, but with it, the expression is easier to understand.

We are almost there. There is still one issue to solve. Since expressions in each line is always matching, the entire expression is also always matching, so the replacement function will be called every character. We can add one more line to ensure that entire expression will match exactly in boundary assertions.

begin-of-line or end-of-line or word-boundary;

The above line will match any of the boundary assertions. There is no begin‑of‑text and end‑of‑text because begin‑of‑line is also matching the begin‑of‑text and end‑of‑line is also matching the end‑of‑text.

The final script is:

import cre from "con-reg-exp";

const input = `
Two households, both alike in dignity,
In fair Verona, where we lay our scene,
From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean.
`;

const boundaries = cre.global`
    (bot: begin-of-text) or (not begin-of-text);
    (eot: end-of-text) or (not end-of-text);
    (bol: begin-of-line) or (not begin-of-line);
    (eol: end-of-line) or (not end-of-line);
    (word: word-boundary) or (not word-boundary);
    begin-of-line or end-of-line or word-boundary;
`;

const result = input.replace(boundaries, (_, bot, eot, bol, eol, word) => {
    let result = [];
    if (bot !== undefined) result.push("bot");
    if (bol !== undefined) result.push("bol");
    if (word !== undefined) result.push("w");
    if (eol !== undefined) result.push("eol");
    if (eot !== undefined) result.push("eot");
    return `<sub>${result.join("+")}</sub>`;
});

console.log(`<pre>${result}</pre>`);
console.log(`<pre>${boundaries.toString().replace(/</g, "&lt;")}</pre>`);

Run it again:

node tutorial-6.mjs > tutorial-5.html

The resulting HTML is now:

bot+bol+eol
bol+wTwow whouseholdsw, wbothw walikew winw wdignityw,eol
bol+wInw wfairw wVeronaw, wwherew wwew wlayw wourw wscenew,eol
bol+wFromw wancientw wgrudgew wbreakw wtow wneww wmutinyw,eol
bol+wWherew wcivilw wbloodw wmakesw wcivilw whandsw wuncleanw.eol
bol+eol+eot

The text contains all the annotation describing boundary assertions.

Lookahead and lookbehind

// TODO: Lookahead and lookbehind

Interpolation

// TODO: Interpolation

Unicode

// TODO: Unicode