6 Handy Regular Expressions Every Front-End Developer Should Know

Leverage the power of regular expressions to perform various text processing tasks.

Faraz Kelhini
Bits and Pieces
Published in
8 min readJan 15, 2020

--

Photo by geralt

Almost all popular programming languages support regular expressions, and there’s a good reason for that: regular expressions provide developers with remarkably powerful tools that enable them to quickly perform tasks that would otherwise require dozens of lines of code.

In this article, we will look at six text processing and manipulation tasks that front-end developers often have to deal with and see how regular expressions simplify the process.

Finding Sentences That Contain a Specific Word

Suppose you want to match all sentences in a text that have a specific word within them. Perhaps you need to show the sentences in search results, or maybe you want to remove them from the text. The regex /[^.!?]*\bword\b[^.!?]*.?/gi allows you to do just that. Here’s how it works in action:

const str = "The apple tree originated in Central Asia. It is cultivated worldwide. Apple matures in late summer or autumn.";
// en.wikipedia.org/wiki/Apple
// find sentences that contain the word "apple"
str.match(/[^.!?]*\bapple\b[^.!?]*.?/gi);
// => ["The apple tree originated in Central Asia.", "Apple matures in late summer or autumn."]

Let’s examine this regex step by step:

  • [^.!?] matches any character that’s not a ., !, or ?
  • * matches zero or more sequences of the preceding item
  • \b matches at a position known as a “word boundary” (a position that’s either followed or preceded by an ASCII letter, digit, or underscore).
  • apple matches the characters literally (because it’s case sensitive, we add the i flag to the end of the regex)
  • \b matches a word boundary
  • [^.!?] matches any character that’s not a ., !, or ?
  • * matches zero or more sequences of the preceding item
  • . matches any character that’s not a line break character
  • ? matches zero or one occurrence of the preceding item
  • g tells the regex engine to match all occurrences rather than stopping after the first match
  • i makes the search case-insensitive

Tip: “harvest” components from your codebase with Bit (Github) to gradually build a UI component library. Use it with your team for a consistent UI, faster development and limitless collaboration. Easily import reusable components to any project, use and even update to sync changes across repositories.

Example: searching for React components shared on bit.dev

Striping Invalid Characters from Filenames

When providing a file to be downloaded, it shouldn’t have certain characters in its name. In Windows OS, for example, the following characters are invalid in filenames and should be stripped:

  • < (less than)
  • > (greater than)
  • : (colon)
  • “ (double quote)
  • / (forward slash)
  • \ (backslash)
  • | (vertical bar or pipe)
  • ? (question mark)
  • * (asterisk)

With regex, striping invalid characters is very simple. Let’s look at an example:

const str = "https://en.wikipedia.org/";str.replace(/[<>|:"*?\\/]+/g, '');    // => "httpsen.wikipedia.org"

[], which is called a character class, matches one of the characters between square brackets. So, by placing all invalid characters within it and adding a global (g) flag at the end of the regular expression, we can effectively strip those character from the string.

Note that in the character class, the backslash has a special meaning and must be escaped with another backslash: \\. The + operator repeats the character class so that a sequence of invalid characters is replaced at the same time, which is good for performance. You can omit it with no effect on the result.

Keep in mind that the second argument of the replace() method must be an empty string, unless you want the invalid character to be replaced with another character.

There are also several reserved names that are used internally by Windows for various tasks and are not allowed as filenames. The reserved names are as follows:

CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9

Microsoft’s Windows Dev Center has a comprehensive article on valid filenames if you’d like to learn more.

To strip reserved names, run the following code:

str.replace(/^(CON|PRN|AUX|NUL|COM1|COM2|COM3|COM4|COM5|COM6|COM7|COM8|COM9|LPT1|LPT2|LPT3|LPT4|LPT5|LPT6|LPT7|LPT8|LPT9)$/i, 'file');

Basically, this code tells the regex engine to replace the characters in str if they form one of the words separated by the pipe character (|). In this case, we can’t use an empty string as the second argument because the file would have no name.

Note that if the string contains any additional character, it won’t be replaced. For example, “con” will be replaced, but not “concord”, which is a valid filename. This is achieved by using ^ and $ in the regex. ^ matches the beginning of the string. It ensures no other characters come before the string we want to match. And $ matches the end of the string.

It’s also possible to write this regex in a more compact manner by using a character class:

str.replace(/^(CON|PRN|AUX|NUL|COM[1-9]|LPT[1-9])$/i, 'file');

[1–9] matches a digit between 1 to 9.

Replacing Multiple Whitespaces with a Single Whitespace

When a webpage is rendered, repeated whitespace characters are displayed as a single whitespace. However, sometimes it’s necessary to clean up user input or other data and replace repeated whitespaces with a single whitespace. Here’s how we can do that with regex:

const str = "  My    opinions may  have changed,    but not the fact that I'm right.";    // Ashleigh Brilliantstr.replace(/\s\s+/g, ' ');// => " My opinions may have changed, but not the fact that I'm right."

This regex only consists of two metacharacters, an operator, and a flag:

  • \s matches a single whitespace character, including ASCII space, tab, line feed, carriage return, vertical tab, and form feed
  • \s matches a single whitespace character again
  • + matches the previous item one or more times
  • g tells the regex engine to match all occurrences rather than stopping after the first match

The result is all whitespace characters that are repeated at least two times are replaced. Notice that the result in the example above still has a whitespace character at the beginning that should be removed. To do so, simply add the trim() function to the end of the statement:

str.replace(/\s\s+/g, ' ').trim();// => "My opinions may have changed, but not the fact that I'm right."

Keep in mind that this code replaces any kind of whitespace character, including ASCII space, tab, line feed, carriage return, vertical tab, and form feed with a space (U+0020) character. So, if a carriage return immediately follows a tab, they will be replaced by a space. If that’s not your intention, and you want to only replace the same type of whitespaces, use this code instead:

str.replace(/(\s)\1+/g, '$1').trim();

\1 is a backreference and matches the same character that was matched in the first pair of parentheses (\s). To replace them, we use $1 in the second argument of replace(), which inserts the character that was matched in the parentheses.

Limiting User Input to Alphanumeric Characters

A common task during web development is to limit the user input to only alphanumeric characters (A–Z, a–z, and 0–9). With regular expressions, achieving this task is very simple: use a character class to define the allowed range of characters, then append a quantifier to it to specify the number of time characters can be repeated:

const input1 = "John543";
const input2 = ":-)";
/^[A-Z0-9]+$/i.test(input1); // → true
/^[A-Z0-9]+$/i.test(input2); // → false

Note: this regex is only appropriate for the English language and won’t match accented letters or letters from other languages.

Here’s how it works:

  • ^ matches the beginning of the string. It ensures no other characters come before the string we want to match.
  • [A-Z0–9] matches a character between A and Z, or between 0 and 9. Because this is case sensitive, we add the i flag to the end of the regex. Alternatively, we can use [A-Za-z0–9] without the flag.
  • + matches the preceding item one or more times. So the input must have at least one non-whitespace alphanumeric character; otherwise, the match fails. If you want to make the field optional, you can use the * quantifier, which matches the preceding item zero or more times.
  • $ matches the end of the string.

Turning URLs into Links

Suppose we have one or more URLs in a text that aren’t HTML anchor element and thus not clickable. We want to convert the URLs into links automatically. To do that, we first need to find the URLs and then wrap each in <a>…</a> tags with <a>’s href attribute pointing to the URL:

const str = "Visit https://en.wikipedia.org/ for more info.";str.replace(/\b(https?|ftp|file):\/\/\S+[\/\w]/g, '<a href="$&">$&</a>');// => "Visit <a href="https://en.wikipedia.org/">https://en.wikipedia.org/</a> for more info."

Note: be careful when using this regex as it won’t match URLs that end with a punctuation mark. It also may not match more complex URLs.

Let’s look at how this code works:

  • \b matches at a position known as a “word boundary”.
  • (https?|ftp|file) matches the characters “https”, or “http”, or “ftp”, or “file”
  • : matches the colon character literally
  • \/ matches the forward slash character literally
  • \S matches any single character that’s not a whitespace
  • + matches the preceding item one or more times
  • [\/\w] matches either a forward slash or a word character. Without this, the regex would match any punctuation marks at the end of the URL.
  • g tells the regex engine to match all occurrences rather than stopping after the first match
  • $& in the second argument of replace() inserts the matched substring into the replacement string

Deleting Duplicate Words

It’s not uncommon for articles and tutorials to contain unwanted duplicate words. Even professional writers have to proofread their writing for such mistakes. A simple search like “the the” on google news will show hundreds of prominent news organizations with duplicate “the” in their articles. Luckily, regular expressions enable us to fix that with a single line of code:

const str = "This this sentence has has double words.";str.replace(/\b(\w+)\s+\1\b/gi, '$1');// => "This sentence has double words."
  • \b matches at a position known as a “word boundary” (a position that’s either followed or preceded by an ASCII letter, digit, or underscore).
  • \w matches a word character (ASCII letter, digit, or underscore)
  • + matches the preceding item one or more times
  • \s matches a whitespace character
  • + matches the preceding item one or more times so that duplicate words with more that one whitespace character between them can be detected
  • \1 is a backreference and matches the same text that was matched in the first pair of parentheses
  • \b matches a word boundary
  • g tells the regex engine to match all occurrences rather than stopping after the first match
  • i makes the search case-insensitive (we want to disregard capitalization differences)
  • $1 in the second argument of replace() inserts the text that was matched in the first pair of parentheses

Conclusion

Regular expressions have become an important part of any programmer’s toolbox. In this article, we looked at how front-end developers could take advantage of regular expressions to perform various tasks. But we just scratched the surface of regular expressions potential.

Taking the time to become proficient at regular expressions will definitely be a worthwhile investment as it will help you overcome various obstacles you encounter in coding.

I hope you found this article useful! If you have any questions or comments, feel free to leave them below. You can follow me on Medium. I’m also on Twitter.

Learn More

--

--

Faraz is a professional JavaScript developer who is passionate about promoting patterns and ideas that make web development more productive. Website: eloux.com