Posted: 28th January 2014

Author: Steven

Tagged: Tutorials

Harness the power of regex

I have always found regex quite difficult to use. I have written down a couple of my most commonly used codes. We will dissect these together and work out what exactly the regex code is doing.

/[A-Z]/gi - The first forward slash is needed for regex to work, just ignore that one. [A-Z] is a group capture looking for every match that has a letter in the content (A through to Z). The next / represents the end of the regex. The gi makes the regex look globally (g) on new lines and makes it ignore case sensitivity (i). This particular regex is quite a slow as it takes time to match all characters.

/^[A-Z0-9._%-]+@[A-Z0-9][A-Z0-9.-]{0,61}[A-Z0-9]\.[A-Z]{2,6}$/i - Ok, so this is an email validator regex (do not just use this on it's own; it isn't sufficient to validate emails). I know it looks complicated at first, but, after understanding each part, it's pretty simple. Let's go through it:

  • ^ means the beginning of the string must match whatever regex follows.

  • [A-Z0-9._%-]+ means the regex will capture all the characters within the square brackets (in bold). The + on the end (in blue) means it will continue to match this group endlessly (known as a greedy match, otherwise it would just capture the first character). So you could have for example stev3n%1_2.@qweb.co.uk but not stev£n%1_2.@qweb.co.uk because the £ sign is not in the square brackets.

  • @ means it will now look for this specific character at that point. Therefore it can not be stev3n.qweb.co.uk.

  • [A-Z0-9][A-Z0-9.-]{0,61}[A-Z0-9] is by far the most complicated part. The bold parts will allow any characters be A to Z 0 to 9 on either size of the middle capture part (in blue). The italic will allow . and - from a 0 to 61 character count. So you could have steven@q-web.com or steven@qwe.b.com or even steven@qw-3.b.com. However you could not have steven@qw-3.bqw-3.bqw-3.bqw-3.bqw-3.bqw-3.bqw-3.bqw-3.bqw-3.bqw-3.bqw-.com because this exceeds 61 characters. If you took out the last - before .com then it would allow it again.

  • \. means it will now look again for a specific character which is a . and the \ before it means it is escaped (because a . in regex means one or many).

  • [A-Z]{2,6}$/i finally this part will match A to Z only if it has at least two characters. If it was steven@qweb.co.u this would not match because there aren't two characters after the final matched . character. Note that this will always look for the last . because of the group above matching . and - from 0 to 61 characters.

Try have a play around with regex on: http://www.regexr.com/, an absolutely excellent website which gives usage, examples and explanations (hover over the regex).

Good luck with learning regex, always remember to break it down and analyse it in steps to work out what it is doing!

Blog posts written by former QWeb employees are not necessarily an accurate indication of the current opinions of QWeb Ltd and the information provided in tutorials might be biased or subjective, or might become out of date.

Discuss this post

Leave a comment

Your email address is used to notify you of new comments to this thread, and also to pull your Gravatar image. Your name, email address, and message are stored as encrypted text and you won't be added to any mailing list and your details won't be shared with any third party.