How to Debug Regular Expressions Step by Step
After years of staring at regex patterns that should work but don't, I developed a systematic debugging approach. Here's my step-by-step method for finding and fixing regex bugs.
I've written probably 20 different email regex patterns in my career. Most of them were wrong. Here's what I learned after years of getting it wrong, and the patterns that actually work.
Okay so funny story. Back in 2019 I'm sitting at my desk at Šikulovi s.r.o. working on some random CSS thing when Petr calls. You know Petr? The bakery chain guy. Anyway he's upset because our signup form rejected his email. I ask him to spell it out: john.o'[email protected]
My regex choked on literally everything. The apostrophe in O'Brien. The plus sign Gmail uses for filtering. The .co.uk extension which apparently needs special handling. I spent the next three hours rewriting everything while maintaining a text file called emails_that_broke_prod.txt which I still update to this day.
So yeah that's how I learned that email validation is deceptively annoying. The RFC 5322 spec (the official email standard) allows insane stuff. "hello world"@example.com is valid. With the space in quotes. Comments in parentheses work somehow. I don't pretend to understand all of it anymore.
Here's what I've been copying into projects for years: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
Not RFC-perfect. Don't care. Catches typos like gmial.com which happens constantly (I see it maybe 3-4 times per week on high-traffic sites). Handles [email protected] for filtering. Works with weird TLDs like .photography and .co.uk.
Quick breakdown: ^ starts the match, [a-zA-Z0-9._%+-]+ gets the username with allowed special chars (found out percent signs are valid from a user in 2021), @ is the at sign, [a-zA-Z0-9.-]+ matches domain, \.[a-zA-Z]{2,} ensures theres a proper TLD with 2+ letters, $ ends it.
The $ is important btw. Had a form where someone typed "[email protected] please call me back" and it passed because there was nothing stopping text after the email. Oops.
One German insurance company wanted to block double dots like [email protected] (which is actually invalid per spec so fair enough). Gave them this monster:
^[a-zA-Z0-9](?:[a-zA-Z0-9._%+-]{0,62}[a-zA-Z0-9])?@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z]{2,})+$
Used it twice in five years. The simple one works for everyone else.
Rejected [email protected] because plus signs looked wrong. Gmail literally invented plus addressing! Rejected [email protected] because my TLD check only allowed exactly 3 letters. Rejected [email protected] for being "too short" even though its perfectly valid.
Also had a guy with [email protected] - four subdomains! No idea why but hey its his email and it works. Learned not to question these things.
Oh and [email protected] which taught me emails are case-insensitive. That one I learned pretty early at least.
No @ at all - obvious. Just @example.com with nothing before - who are you?? Just email@ with nothing after - going where exactly? Double dots like [email protected] - invalid per RFC. Starting with a dot like [email protected] - also invalid. Domain starting with hyphen like [email protected] - nope.
If your pattern lets any of these through go fix it.
Yeah input type="email" exists. Chrome does it one way, Firefox slightly different, Safari is Safari. I always validate server-side too because people disable JavaScript and pentesters definitely don't run your frontend.
Had a security audit once where the guy curled garbage directly to our API. Backend validation saved us there.
Munich project 2022. Client's coworker had mü[email protected] with the umlaut. My ASCII-only regex rejected it. Except RFC 6531 made international emails valid years ago??
So now пользователь@example.com is real. So is user@例え.jp. You need the /u flag for Unicode in JavaScript or do Punycode conversion. Honestly I usually just ask if they have an alternate ASCII email. Not elegant but works.
A fully RFC-5322 compliant regex is like 6000 characters. The spec allows quoted strings with spaces, inline comments, all sorts of chaos. In eight years I've never seen a real user with "john doe"@example.com as their actual email.
If one shows up someday I'll handle it manually. Not building a regex monster for a 0.001% case.
Regex catches format errors and typos. Verification emails catch fake addresses. Don't be too strict or users get frustrated.
Built a typo suggester that asks "Did you mean gmail.com?" when someone types gmial.com. Catches mistakes before they bounce. Clients love it because it reduces support tickets.
For other languages same pattern works everywhere: Python uses re.match(), PHP uses preg_match(), Java needs extra backslashes as usual. Test everything with a collection of weird emails before shipping. I keep 47 test cases in a file from past production incidents. Five minutes of testing beats Friday evening hotfixes.
I use ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ for most projects. It's not perfect, but it catches obvious typos without rejecting valid emails. Good enough for 99% of cases.
For quick form validation? Regex is fine. For anything serious, use a library like validator.js. Trust me, they've already handled the edge cases you haven't thought of.
Probably too strict. I've made this mistake a dozen times. Check if you're allowing plus signs (user+tag@), long TLDs like .photography, and subdomains. Test with real emails!
Nope. Regex just checks format. The email could be fake or mistyped. If it matters, send a verification email. That's the only way to know it actually works.
Simple: /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/.test(email). Returns true or false. I've got this pattern memorized at this point.
Technically yes, but the pattern would be insane. The spec allows quoted strings, comments, all sorts of weird stuff. Just use a simple pattern and handle the rare exceptions manually.
Founder of CodeUtil. Web developer building tools I actually use. When I'm not coding, I experiment with productivity techniques (with mixed success).
After years of staring at regex patterns that should work but don't, I developed a systematic debugging approach. Here's my step-by-step method for finding and fixing regex bugs.
This is the regex cheat sheet I keep bookmarked. After years of writing patterns at Šikulovi s.r.o., I have compiled the syntax I actually use daily, plus the gotchas that used to trip me up.
I used to debug regex by trial and error in my code. Compile, test, fail, repeat. Now I test patterns live before writing a single line. Here's how I actually use this thing, plus the patterns I copy-paste constantly.