Skip to content

email: RFC 2047 encoded-word in an addr-spec local-part corrupts address parsing #152519

Description

@serhiy-storchaka

Bug report

The modern email parser (email._header_value_parser, used by email.policy.default and email.headerregistry) decodes an RFC 2047 encoded-word that appears inside an addr-spec local-part, and for an obsolete local-part it then re-parses the decoded text. RFC 2047 §5 forbids encoded-words in an addr-spec, and re-parsing decoded text corrupts the result — most visibly when the encoded-word decodes to a character that is special in the address grammar (e.g. =40 decodes to @).

Reproducers

Reproduced on current main (3.16.0a0); the code is unchanged since the parser was introduced, so all maintained versions are affected. The legacy compat32 parser does not decode encoded-words during address parsing and is unaffected.

An encoded-word in the local-part is silently decoded (already a violation of RFC 2047 §5):

>>> from email import message_from_string, policy
>>> m = message_from_string("To: =?utf-8?q?admin?=@example.com\n\n", policy=policy.default)
>>> m['to'].addresses[0].username
'admin'

When the encoded-word decodes to a bare special, the re-parse misaligns and the address is silently corrupted to the null address:

>>> # =40 is the Q-encoding of '@'
>>> m = message_from_string("To: =?utf-8?q?a=40b?=c@host.com\n\n", policy=policy.default)
>>> m['to'].addresses[0]
Address(display_name='', username='', domain='')

The same input through headerregistry.Address leaks an internal parse error:

>>> from email.headerregistry import Address
>>> Address(addr_spec='=?utf-8?q?a=40b?=c@host.com')
Traceback (most recent call last):
  ...
email.errors.HeaderParseError: Invalid Domain

Cause

get_local_part() parses an obsolete local-part by rendering the already-parsed tokens back to text with str() and re-parsing that text:

obs_local_part, value = get_obs_local_part(str(local_part) + value)

str(local_part) decodes any encoded-word, so the text handed back to get_obs_local_part() differs from the source. When the decoded text contains a special such as @, the re-parse stops in the middle of the rendered prefix, leaving spurious decoded characters in the remainder and corrupting the address (and, downstream, raising Invalid Domain).

Per RFC 2047 §5 an encoded-word may only appear in place of text, inside a comment, or in place of a word within a phrase (the display-name) — never inside an addr-spec. So the local-part should not be decoded at all.

Relationship to GH-136063

Found while working on GH-136063: get_local_part is one of the quadratic-complexity spots there. The fix re-parses the obsolete local-part from the original source text instead of from the decoded str() rendering, which also removes this corruption — an encoded-word in a local-part is then simply reported as an invalid address rather than decoded and misparsed.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions