Worse than RegEx: comments in mail addresses

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

arjan
Posts: 67
Joined: Tue Feb 26, 2013 7:48 pm UTC
Location: The Netherlands

Worse than RegEx: comments in mail addresses

Postby arjan » Tue Jan 27, 2015 11:04 pm UTC

While trying to find the right XKCD to show someone regular expressions are the worst, I got to this page first: http://ex-parrot.com/~pdw/Mail-RFC822-Address.html

which says: "The regular expression does not cope with comments in email addresses. The RFC allows comments to be arbitrarily nested. A single regular expression cannot cope with this. The Perl module pre-processes email addresses to remove comments before applying the mail regular expression."

So a single regular expressions cannot even cope with nested comments? I find that a bit hard too believe. Harder to believe is the fact that the RFC (no pun) actually ALLOWS for comments!

From the RFC:
"The comment construct permits message originators to add text which will be useful for human readers, but which will be ignored by the formal semantics. Comments should be retained while the message is subject to interpretation according to this standard. However, comments must NOT be included in other cases, such as during protocol exchanges with mail servers."

If I understand correctly, you can locally have perfectly valid addresses like "john(the (unofficial) boss!)@(shitty anyway)mycompany.com(really shitty)" where Outlook should gladly accept that as a valid address, but should not use those comments when talking to the SMTP server.

No wonder the battle between Internet/SMTP and FidoNet was lost by the latter. Nobody thought about comments in mail addresses!

User avatar
phlip
Restorer of Worlds
Posts: 7572
Joined: Sat Sep 23, 2006 3:56 am UTC
Location: Australia
Contact:

Re: Worse than RegEx: comments in mail addresses

Postby phlip » Wed Jan 28, 2015 1:03 am UTC

arjan wrote:So a single regular expressions cannot even cope with nested comments? I find that a bit hard too believe.

In general, regular expressions can't deal with nested anything. Certain regex systems have extensions to allow recursion, and then those can do nesting, but they're few and far between... but still, strictly speaking, languages that have nesting are not regular.

arjan wrote:Harder to believe is the fact that the RFC (no pun) actually ALLOWS for comments!

IIRC, RFC822 is one of those standards that came into place significantly after the thing they were standardising was already in wide use. So the standard had to accept a lot of weird things that software was already doing.

Honestly, the more interesting thing to me that they codified in the RFC is that if you enter: "Some Guy" <someguy@example.com> as your recipient, that whole thing is the email address, as far as the RFC is concerned. While most people would look at that and say that just someguy@example.com is the address, and the "Some Guy" part is just a display name. It's things like that where a lot of the complexity in that regex come from - being able to match things like this.

arjan wrote:If I understand correctly, you can locally have perfectly valid addresses like "john(the (unofficial) boss!)@(shitty anyway)mycompany.com(really shitty)" where Outlook should gladly accept that as a valid address, but should not use those comments when talking to the SMTP server.

To be clear, it would be stripping out the comments only during the SMTP part of the interaction... it would still be there in the MIME header (and your boss would totally be able to see it). The same thing happens with those "display name" segments. That is, the interaction could look something like:

Code: Select all

> 220 mycompany.com ESMTP
MAIL FROM:<arjan@mycompany.com>
> 250 ok
RCPT TO:<john@mycompany.com>
> 250 ok
DATA
> 354 ok
From: "Arjan" <arjan@mycompany.com>
To: "Ugh, this guy again" <john(the (unofficial) boss!)@(shitty anyway)mycompany.com(really shitty)>
Subject: Trying out email comments!
Content-Type: text/plain
Date: Thu, 1 Jan 1970 00:00:00

Hi John,
Playing around with this newfangled "email" thing!
.
> 200 ok, message sent
(Though, since you mention Outlook, there's a good chance the message won't be touching SMTP at all, and will all be done via Exchange's own personal weirdness...)

Code: Select all

enum ಠ_ಠ {°□°╰=1, °Д°╰, ಠ益ಠ╰};
void ┻━┻︵​╰(ಠ_ಠ ⚠) {exit((int)⚠);}
[he/him/his]

User avatar
PM 2Ring
Posts: 3713
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Sydney, Australia

Re: Worse than RegEx: comments in mail addresses

Postby PM 2Ring » Wed Jan 28, 2015 2:39 am UTC

@arjan : Now you've gone & made me all nostalgic for Fidonet...

I found it rather bizarre when I first learned that comments are permitted in email addresses; I'm still not very impressed with the concept.

FWIW, there's a classic post on Stack Exchange regarding parsing HTML with Regular Expressions, reproduced below for your edification. :)
Spoiler:
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg​ex parsers for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection wil​l devour your HT​ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi​ght he com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​ression parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

User avatar
ucim
Posts: 6859
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: Worse than RegEx: comments in mail addresses

Postby ucim » Mon Feb 02, 2015 11:14 pm UTC

At the risk of summoning cthulu, clearly there is a subset of HTML code that is amenable to parsing by regex, and if you know that the HTML in question belongs to this subset, you should be fine with it. Is there however a way to determine whether or not a particular piece of HTML belongs in this set? It's ok if it rejects HTML that could have been parsed by a clever enough regex, but it's not ok if it mistakenly admits one of the Great Horrors.

(If it is possible to do this, then is it also possible to do this with regex itself?)

Because then you could do:

input (&HTML);
if (check_for_evil(HTML) === no_evil)
{ parse_with_regex(HTML);
}
else
{ throw (bobcat);
}

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

User avatar
karhell
Posts: 687
Joined: Wed Jun 19, 2013 4:56 pm UTC
Location: Breizh

Re: Worse than RegEx: comments in mail addresses

Postby karhell » Tue Feb 03, 2015 10:28 am UTC

The thing is : your check_for_evil method is going to need to parse the html anyway (and many languages already have tools designed to do just that), so why not take advantage of that parsing to do what the regex was going to do, but in a single pass ?

That said, if the html you're looking at has a well defined structure that you know of beforehand, then regex are a perfectly fine.
AluisioASG wrote:191 years ago, the great D. Pedro I drew his sword and said: "Indent thy code or die!"
lmjb1964 wrote:We're weird but it's okay.
ColletArrow, katakissa, iskinner, thunk, GnomeAnne, Quantized, and any other Blitzers, have fun on your journey!

User avatar
ucim
Posts: 6859
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: Worse than RegEx: comments in mail addresses

Postby ucim » Tue Feb 03, 2015 1:17 pm UTC

karhell wrote:The thing is : your check_for_evil method is going to need to parse the html anyway (and many languages already have tools designed to do just that), so why not take advantage of that parsing to do what the regex was going to do, but in a single pass ?
You don't have to fully parse it in order to determine good or evil (depending on how you define it). This means that a check_for_evil could in theory be written to be much faster than a full parse.

In theory of course there's no difference between theory and practice. In practice there is.

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

User avatar
Sizik
Posts: 1255
Joined: Wed Aug 27, 2008 3:48 am UTC

Re: Worse than RegEx: comments in mail addresses

Postby Sizik » Tue Feb 03, 2015 3:37 pm UTC

I think the easiest way to check whether it could be regexable would be to make sure that there are no self-nested tags, e.g. a <div> within a <div>.
she/they
gmalivuk wrote:
King Author wrote:If space (rather, distance) is an illusion, it'd be possible for one meta-me to experience both body's sensory inputs.
Yes. And if wishes were horses, wishing wells would fill up very quickly with drowned horses.

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: Worse than RegEx: comments in mail addresses

Postby EvanED » Tue Feb 03, 2015 7:03 pm UTC

I think the easiest way to determine whether it's regexable is check whether there are any < symbols; if there aren't, then you can parse the HTML. That's the fun of one-sided tests!

(I'm being pretty facetious here, but I'm also trying to make a point. Even a regex for "there are no nested tags" would be much uglier than is worth it, unless you have a very small number of tags.)

User avatar
Thesh
Made to Fuck Dinosaurs
Posts: 6579
Joined: Tue Jan 12, 2010 1:55 am UTC
Location: Colorado

Re: Worse than RegEx: comments in mail addresses

Postby Thesh » Tue Feb 03, 2015 7:18 pm UTC

That's why someone needs to invent some sort of SGML or XML parser that handles nesting.
Summum ius, summa iniuria.

User avatar
ucim
Posts: 6859
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: Worse than RegEx: comments in mail addresses

Postby ucim » Tue Feb 03, 2015 7:52 pm UTC

EvanED wrote:(I'm being pretty facetious here, but I'm also trying to make a point. Even a regex for "there are no nested tags" would be much uglier than is worth it, unless you have a very small number of tags.)
Well, the test doesn't have to be a regex. However, then the parser doesn't either, and I suppose you defeat the purpose of using regex.

What would be the purpose of using a regex rather than a function?

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

User avatar
Yakk
Poster with most posts but no title.
Posts: 11128
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

Re: Worse than RegEx: comments in mail addresses

Postby Yakk » Tue Feb 03, 2015 9:07 pm UTC

RegEx is great when you are like "I want a raw text search, but wait, maybe a bit more".

Then, if it really is just a bit more, use RegEx.

Use it as a slight improvement over a normal case sensitive/insensitive text search.

Anything beyond that, you are using your familiarity with a tool to justify treating things as targets for the tool. If only there was a more pithy way of saying that.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.

User avatar
ucim
Posts: 6859
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: Worse than RegEx: comments in mail addresses

Postby ucim » Tue Feb 03, 2015 11:54 pm UTC

Yakk wrote:Anything beyond that, you are using your familiarity with a tool to justify treating things as targets for the tool. If only there was a more pithy way of saying that.
Like using a jackhammer to cut a doorway in a treehouse?

What would you recommend then, for validating an email address. Right now I am using an ugly regex:
Spoiler:
I've seen uglier

Code: Select all

$regex="/^(?:[A-Za-z0-9_%+]+\.?)[A-Za-z0-9_%+-]{0,63}"
   ."@".
   "(?:(?:[A-Za-z0-9]\.)|(?:[A-Za-z0-9][A-Za-z0-9-]{0,61}[A-Za-z0-9]\.))+(?:[A-Za-z]{2,4}|museum|travel)$/";
if (preg_match($regex, $email))
{  $input_ok=TRUE;
}

after checking length and stripping tags. I know it rejects the following good emails:
&'*+-./=?^_{}~@other-valid-characters-in-local.net
mixed-1234-in-{+^}-local@sld.net
local@sld.newTLD
punycode-numbers-in-tld@sld.xn--3e0b707e
"quoted"@sld.com
"\e\s\c\a\p\e\d"@sld.com
"quoted-at-sign@sld.org"@sld.com
"escaped\"quote"@sld.com
"back\slash"@sld.com
bracketed-IP-instead-of-domain@[127.0.0.1]

But I don't really care. It's for validating input from a form, and anybody trying tricks like that should be fed a bobcat for breakfast.

And yes, I'm going to have to stay on top of all those new TLDs being cooked up. Might as well just open up the field completely. :/

Got a better way?

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

User avatar
Thesh
Made to Fuck Dinosaurs
Posts: 6579
Joined: Tue Jan 12, 2010 1:55 am UTC
Location: Colorado

Re: Worse than RegEx: comments in mail addresses

Postby Thesh » Wed Feb 04, 2015 12:03 am UTC

Summum ius, summa iniuria.

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: Worse than RegEx: comments in mail addresses

Postby EvanED » Wed Feb 04, 2015 4:26 am UTC

I think for most applications, I find convincing the argument that the best way to validate an email is to... send email to it. (That is, the usual validation thing.)

User avatar
ucim
Posts: 6859
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: Worse than RegEx: comments in mail addresses

Postby ucim » Wed Feb 04, 2015 3:24 pm UTC

Thanks Thesh. Where can I find the logic behind this function? It validates nul and anyoldplace as valid TLDs. Granted, the regex lets nul through, but at least I know what I'm admitting. With the PHP filter, I don't. (I would have expected at least a current TLD lookup to occur for a maintained PHP function).

As for sending mail to it, delivery of good email is not guaranteed (hotmail, I'm looking at you), and bouncing of bad email is not guaranteed. In any case, neither is timely.

eta: the filter misses
"Ugh, this guy again" <john(the (unofficial) boss!)@(shitty anyway)mycompany.com(really shitty)>
and flags it as bogus. Well, it should be bogus, but according to this thread, it is supposed to be valid.

It also boguses
Fred <fred@fred.com>
alfred@alfred.com <Alfred>


which is what I think I would want.... actually, on second thought, I'd like a trinary output. "Pure", "valid but contaminated", "bogus".

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: Worse than RegEx: comments in mail addresses

Postby EvanED » Wed Feb 04, 2015 6:08 pm UTC

ucim wrote:(I would have expected at least a current TLD lookup to occur for a maintained PHP function).
But why? (See below.)

As a concrete example of a problem, consider someone using PHP, I dunno, 5.5. ICANN comes along and adds a new TLD. The person on 5.5 doesn't upgrade because it's working for him. Now people with the new TLD can't sign up on the site of this person.

As for sending mail to it, delivery of good email is not guaranteed (hotmail, I'm looking at you), and bouncing of bad email is not guaranteed. In any case, neither is timely.
Here's "my" argument.

What is validation protecting against? People typoing their address or people deliberately entering bad emails? Because a tight validation almost helps with neither. If I typo my address, I'm almost certain to typo it to something else that is syntactically valid, e.g., evaan@example.com. Maybe you could typo it as evan@example.con or something, but you're not going to catch most typos. If you are trying to protect against malice, it's trivial for someone who doesn't want to give you a real address to provide a fake-but-syntactically-fine address that will thus pass your syntactic validation.

If you are collecting email addresses because you want to send emails to them, the only way to determine if you actually can (and that it'll go to the right person) is to send an email to it. If you are not collecting emails addresses because you want to send emails to them, then why are you collecting them?

User avatar
ucim
Posts: 6859
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: Worse than RegEx: comments in mail addresses

Postby ucim » Wed Feb 04, 2015 6:45 pm UTC

EvanED wrote:What is validation protecting against?
Primarily somebody inserting evil into an address, such as a script or something I never thought of. It's part of "never trust user data". It is also a(n imperfect) way to ensure that the address is at least complete. Can't catch all errors, but at least I can get some.

EvanED wrote:If you are collecting email addresses because you want to send emails to them, the only way to determine if you actually can (and that it'll go to the right person) is to send an email to it. If you are not collecting emails addresses because you want to send emails to them, then why are you collecting them?
In the case that I want to send them emails, they have to confirm by replying before I'll set up the account. But in the case of their supplying contact info to the public, this is impractical. Ditto telephone numbers - I'm not going to call people to ensure the phone rings and I don't think people want me to do that!

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

User avatar
Xanthir
My HERO!!!
Posts: 5413
Joined: Tue Feb 20, 2007 12:49 am UTC
Location: The Googleplex
Contact:

Re: Worse than RegEx: comments in mail addresses

Postby Xanthir » Thu Feb 05, 2015 11:02 pm UTC

ucim wrote:(I would have expected at least a current TLD lookup to occur for a maintained PHP function).

There's not really any such thing as "the list of TLDs". Even before ICANN opened up the list for anyone with a few tens of thousands of dollars, the list still changed regularly due to the list of country TLDs constantly being fiddled with.
(defun fibs (n &optional (a 1) (b 1)) (take n (unfold '+ a b)))

bittyx
Posts: 194
Joined: Tue Sep 25, 2007 9:10 pm UTC
Location: Belgrade, Serbia

Re: Worse than RegEx: comments in mail addresses

Postby bittyx » Sat Feb 07, 2015 8:34 am UTC

There's also this: http://blog.mailgun.com/free-email-vali ... web-forms/ - I've never used it, as I haven't had a need for it, but I've heard about it, and it sounds okay. In the first comment on that article they also mention that they do not store any email addresses, but you'd still probably have to read through their Privacy Policy first, to be sure.

EDIT: Assuming you're using composer (and you really should be!), there's a lot of results when searching for email validator. This package has 100K+ installs: https://github.com/egulias/EmailValidator and it seems to be thoroughly researched, and even checks the DNS records for the domain in the email address.

Also, somewhat related.


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 7 guests