Bug report
The modern email parser (email._header_value_parser, used by email.policy.default and email.headerregistry) decodes an RFC 2047 encoded-word that appears inside an addr-spec local-part, and for an obsolete local-part it then re-parses the decoded text. RFC 2047 §5 forbids encoded-words in an addr-spec, and re-parsing decoded text corrupts the result — most visibly when the encoded-word decodes to a character that is special in the address grammar (e.g. =40 decodes to @).
Reproducers
Reproduced on current main (3.16.0a0); the code is unchanged since the parser was introduced, so all maintained versions are affected. The legacy compat32 parser does not decode encoded-words during address parsing and is unaffected.
An encoded-word in the local-part is silently decoded (already a violation of RFC 2047 §5):
>>> from email import message_from_string, policy
>>> m = message_from_string("To: =?utf-8?q?admin?=@example.com\n\n", policy=policy.default)
>>> m['to'].addresses[0].username
'admin'
When the encoded-word decodes to a bare special, the re-parse misaligns and the address is silently corrupted to the null address:
>>> # =40 is the Q-encoding of '@'
>>> m = message_from_string("To: =?utf-8?q?a=40b?=c@host.com\n\n", policy=policy.default)
>>> m['to'].addresses[0]
Address(display_name='', username='', domain='')
The same input through headerregistry.Address leaks an internal parse error:
>>> from email.headerregistry import Address
>>> Address(addr_spec='=?utf-8?q?a=40b?=c@host.com')
Traceback (most recent call last):
...
email.errors.HeaderParseError: Invalid Domain
Cause
get_local_part() parses an obsolete local-part by rendering the already-parsed tokens back to text with str() and re-parsing that text:
obs_local_part, value = get_obs_local_part(str(local_part) + value)
str(local_part) decodes any encoded-word, so the text handed back to get_obs_local_part() differs from the source. When the decoded text contains a special such as @, the re-parse stops in the middle of the rendered prefix, leaving spurious decoded characters in the remainder and corrupting the address (and, downstream, raising Invalid Domain).
Per RFC 2047 §5 an encoded-word may only appear in place of text, inside a comment, or in place of a word within a phrase (the display-name) — never inside an addr-spec. So the local-part should not be decoded at all.
Found while working on GH-136063: get_local_part is one of the quadratic-complexity spots there. The fix re-parses the obsolete local-part from the original source text instead of from the decoded str() rendering, which also removes this corruption — an encoded-word in a local-part is then simply reported as an invalid address rather than decoded and misparsed.
Linked PRs
Bug report
The modern
emailparser (email._header_value_parser, used byemail.policy.defaultandemail.headerregistry) decodes an RFC 2047 encoded-word that appears inside anaddr-speclocal-part, and for an obsolete local-part it then re-parses the decoded text. RFC 2047 §5 forbids encoded-words in anaddr-spec, and re-parsing decoded text corrupts the result — most visibly when the encoded-word decodes to a character that is special in the address grammar (e.g.=40decodes to@).Reproducers
Reproduced on current
main(3.16.0a0); the code is unchanged since the parser was introduced, so all maintained versions are affected. The legacycompat32parser does not decode encoded-words during address parsing and is unaffected.An encoded-word in the local-part is silently decoded (already a violation of RFC 2047 §5):
When the encoded-word decodes to a bare special, the re-parse misaligns and the address is silently corrupted to the null address:
The same input through
headerregistry.Addressleaks an internal parse error:Cause
get_local_part()parses an obsolete local-part by rendering the already-parsed tokens back to text withstr()and re-parsing that text:str(local_part)decodes any encoded-word, so the text handed back toget_obs_local_part()differs from the source. When the decoded text contains a special such as@, the re-parse stops in the middle of the rendered prefix, leaving spurious decoded characters in the remainder and corrupting the address (and, downstream, raisingInvalid Domain).Per RFC 2047 §5 an encoded-word may only appear in place of
text, inside acomment, or in place of awordwithin aphrase(the display-name) — never inside anaddr-spec. So the local-part should not be decoded at all.Relationship to GH-136063
Found while working on GH-136063:
get_local_partis one of the quadratic-complexity spots there. The fix re-parses the obsolete local-part from the original source text instead of from the decodedstr()rendering, which also removes this corruption — an encoded-word in a local-part is then simply reported as an invalid address rather than decoded and misparsed.Linked PRs