[tor-dev] Proposal 285: Directory documents should be standardized as UTF-8

Discussion:

Nick Mathewson

2017-11-13 18:51:27 UTC

Filename: 285-utf-8.txt
Title: Directory documents should be standardized as UTF-8
Author: Nick Mathewson
Created: 13 November 2017
Status: Open

1. Summary and motivation

People frequently want to include non-ASCII text in their router
descriptors. The Contact line is a favorite place to do this, but in
principle the platform line would also be pretty logical.

Unfortunately, there's no specified way to encode non-ASCII in our
directory documents.

Fortunately, almost everybody who does it, uses UTF-8 anyway.

As we move towards Rust support in Tor, we gain another motivation
for standarding on UTF-8, since Rust's native strings strongly prefer
UTF-8.

So, in this proposal, we describe a migration path to having all
directory documents be fully UTF-8.

2. Proposal

First, we should have Tor relays reject ContactInfo lines (and any
other lines copied directly into router descriptors) that are not
UTF-8.

At the same time, we should have authorities reject any router
descriptors or extrainfo documents that are not valid UTF-8.
Simultaneously, we can have all Tor instances reject all
non-directory-descriptor directory documents that are not UTF-8,
since none should exist today.

Finally, once the authorities have updated, we should have all Tor
instances reject all directory documents that are not UTF-8. (We
should not take this step until the authorities have upgraded, or
else the behavior of updated and non-updated clients could be
distinguished.)

2.1. Hidden service descriptors' encrypted bodies

For the encrypted bodies of hidden service descriptors, we cannot
reject them at the authority level, and so we need to take a slightly
different approach to prevent client fingerprinting attacks.

First, we should make Tor instances start warning about any hidden
service descriptors whose bodies, post-decryption, contain non-utf-8
plaintext. At the same time, we add a consensus parameter to
indicate that hidden service descriptors with non-utf-8 plantexts
should be rejected entirely: "reject-encrypted-non-utf-8". If that
parameter is set to 1, then hidden service clients will not only
warn, but reject the descriptors.

Once the vast majority of clients are running versions that support
the "reject-encrypted-non-utf-8" parameter, that parameter can be set
to 1.

teor

2017-11-13 22:28:44 UTC

Permalink

How many current descriptors will be rejected as non-UTF-8?

Post by Nick Mathewson
As we move towards Rust support in Tor, we gain another motivation
for standarding on UTF-8, since Rust's native strings strongly prefer
UTF-8.
So, in this proposal, we describe a migration path to having all
directory documents be fully UTF-8.
2. Proposal
First, we should have Tor relays reject ContactInfo lines (and any
other lines copied directly into router descriptors) that are not
UTF-8.

How do we define UTF-8?

Do we exclude all invalid byte sequences?
Do we exclude all invalid code points (some libraries don't)?
https://en.m.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

Do we reject unassigned or reserved code points?
Do we reject private use code points?
https://en.m.wikipedia.org/wiki/Unicode#General_Category_property

How do we avoid tying ourselves to a particular version of Unicode?
(By accepting reserved code points? Some libraries don't do this.)

Will we allow a byte order mark?
(We can't during the transition, it doesn't parse as ASCII.
And we probably shouldn't for any verbatim lines, because they
are copied into the middle of the descriptor.)

How do we carry forward existing ASCII restrictions into UTF-8?

We will need to update the directory spec to acknowledge that
contact and platform lines may be parsed as UTF-8 or
ASCII-including-arbitrary-bytes-except-NUL, and that they are
terminated by single-byte newlines regardless.

How do we deal with format confusion attacks?

UTF-8 has a few alternative whitespace characters. These could
be used in an attack that confuses either humans viewing the file,
or automated software:

If a human uses a UTF-8 compatible viewer or editor, it likely shows
Unicode newlines and ASCII newlines in an identical way. Similarly,
it may show Unicode spaces and ASCII spaces in the same way.
This may confuse the human reader.

Similarly, if automated software parses using a Unicode whitespace
or newline character class, it will mis-parse directory documents.
(Our Rust protover code looks for ASCII spaces, so it appears to
be fine.)

Note that we already have this issue with line feeds and carriage
returns, which I thought we had solved by banning carriage returns
in directory documents. But it appears we allow "any printing ASCII
character". (We will have to edit this to include Unicode.)

https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n218

Post by Nick Mathewson
At the same time, we should have authorities reject any router
descriptors or extrainfo documents that are not valid UTF-8.
Simultaneously, we can have all Tor instances reject all
non-directory-descriptor directory documents that are not UTF-8,
since none should exist today.

If we apply the existing restrictions in dir-spec, which require
non-directory-descriptor directory documents to be ASCII, they will
also be UTF-8.

Isn't it confusing to say "UTF-8", when what we really mean is "ASCII"?
Do we expect to migrate these to non-ASCII UTF-8 at some point?

Also, does "non-directory-descriptor directory documents" mean we
can reject non-UTF-8 microdescriptors? I think we should.

Does the NS consensus contain any lines that are copied verbatim from
descriptors?

Post by Nick Mathewson
Finally, once the authorities have updated, we should have all Tor
instances reject all directory documents that are not UTF-8. (We
should not take this step until the authorities have upgraded, or
else the behavior of updated and non-updated clients could be
distinguished.)
2.1. Hidden service descriptors' encrypted bodies
For the encrypted bodies of hidden service descriptors, we cannot
reject them at the authority level, and so we need to take a slightly
different approach to prevent client fingerprinting attacks.
First, we should make Tor instances start warning about any hidden
service descriptors whose bodies, post-decryption, contain non-utf-8
plaintext. At the same time, we add a consensus parameter to
indicate that hidden service descriptors with non-utf-8 plantexts

typo: plaintexts

We also can't reject bridge descriptors at the authority level.
(Bridge clients download bridge descriptors directly from bridges.)
Do we need bridge clients to also use this consensus parameter?

T

Nick Mathewson

2018-01-09 17:34:06 UTC

Permalink

Hi, Teor, and sorry for the long delay! You had a lot of good
questions on this proposal, and I didn't know how to answer them all.
So in hopes of making progress here, I'm taking wild guesses and
asking for help in making the wild guesses better :)

I think that when last I checked, the number was something like 3.

I tried to do so as follows:

We define the allowable set of UTF-8 as:
* Encoding the codepoints U+01 through U+10FFFF,
* but excluding the codepoints U+D800 through U+DFFF,
* each encoded with the shortest possible encoding.
* without any BOM

Are there other restrictions we should make? If so, how should we phrase them?

[...]

Post by Nick Mathewson
How do we carry forward existing ASCII restrictions into UTF-8?

I don't understand this question.

Post by Nick Mathewson
We will need to update the directory spec to acknowledge that
contact and platform lines may be parsed as UTF-8 or
ASCII-including-arbitrary-bytes-except-NUL, and that they are
terminated by single-byte newlines regardless.

Ack.

Post by Nick Mathewson
How do we deal with format confusion attacks?
UTF-8 has a few alternative whitespace characters. These could
be used in an attack that confuses either humans viewing the file,
If a human uses a UTF-8 compatible viewer or editor, it likely shows
Unicode newlines and ASCII newlines in an identical way. Similarly,
it may show Unicode spaces and ASCII spaces in the same way.
This may confuse the human reader.

Right. I don't see an obvious attack here, but we should keep it in mind.

Do you have a different suggestion of what to do here?

Post by Nick Mathewson
Similarly, if automated software parses using a Unicode whitespace
or newline character class, it will mis-parse directory documents.
(Our Rust protover code looks for ASCII spaces, so it appears to
be fine.)
Note that we already have this issue with line feeds and carriage
returns, which I thought we had solved by banning carriage returns
in directory documents. But it appears we allow "any printing ASCII
character". (We will have to edit this to include Unicode.)

Also let's consider all the nonprinting ASCII: it's already a
potential display problem if you're using a bad editor, or whatever.

Post by Nick Mathewson
https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n218
At the same time, we should have authorities reject any router
descriptors or extrainfo documents that are not valid UTF-8.
Simultaneously, we can have all Tor instances reject all
non-directory-descriptor directory documents that are not UTF-8,
since none should exist today.
If we apply the existing restrictions in dir-spec, which require
non-directory-descriptor directory documents to be ASCII, they will
also be UTF-8.
Isn't it confusing to say "UTF-8", when what we really mean is "ASCII"?
Do we expect to migrate these to non-ASCII UTF-8 at some point?

I think having non-ASCII in extrainfos is a reasonable possibility.
I'm not so sure about the others: there could be reasons in the
future.

My rationale for declaring everything to be UTF-8 was that it seemed
more reasonable to have a single set of rules for parsing everything
than to have different rules for different documents.

Post by Nick Mathewson
Also, does "non-directory-descriptor directory documents" mean we
can reject non-UTF-8 microdescriptors? I think we should.

I think so.

Post by Nick Mathewson
Does the NS consensus contain any lines that are copied verbatim from
descriptors?

I don't think so.

[...]

Post by Nick Mathewson
should be rejected entirely: "reject-encrypted-non-utf-8". If that
parameter is set to 1, then hidden service clients will not only
warn, but reject the descriptors.
Once the vast majority of clients are running versions that support
the "reject-encrypted-non-utf-8" parameter, that parameter can be set
to 1.
We also can't reject bridge descriptors at the authority level.
(Bridge clients download bridge descriptors directly from bridges.)
Do we need bridge clients to also use this consensus parameter?

I added an extra section for this, basically saying "bridge clients
should do that too":

2.2. Bridge descriptors

Since clients download bridge descriptors directly from the bridges, they
also need a two-phase plan as for hidden service descriptors above. Here
we take the same approach as in section 2.1 above, except using the
parameter "reject-bridge-descriptor-non-utf-8".

teor

2018-01-10 00:19:54 UTC

Permalink

Post by Nick Mathewson

I think that when last I checked, the number was something like 3.

* Encoding the codepoints U+01 through U+10FFFF,
* but excluding the codepoints U+D800 through U+DFFF,

These are called "Unicode Scalar Values".
https://www.unicode.org/glossary/#unicode_scalar_value

Let's reference that.

Post by Nick Mathewson
* each encoded with the shortest possible encoding.
* without any BOM
Are there other restrictions we should make? If so, how should we phrase them?

These seem fine, and not tied to a particular unicode version.

But I don't know enough about Unicode to know if there is anything else we should
specify.

I know how we'd do this in C (raw bytes with a check before parsing), and I think
we can do this in Rust using char:
https://doc.rust-lang.org/1.0.0/unicode/char/

Here are some other things we might want to document:

Unassigned Code Points

Accepting arbitrary unassigned unicode code points may cause issues for some
parsers, because as far as I am aware, parsers typically only handle a particular
unicode version. We should note this in the spec.

The potential attack here is that Tor accepts a newly introduced character, and
a downstream parser rejects it. But that's not Tor's problem.

The right way for parsers to handle this is to replace unknown characters with
an appropriate replacement character. (Unicode has rules for this.) Or possibly
throw an error. We can't make this decision for them: it depends on the goals of
the parser.

Equality and Normalisation

We should also make sure that equality is specified as byte-for-byte equality.
This means that several different byte sequences could be visually similar, and
even have identical normalised forms, but we would treat them as different.

Unicode has several levels of normalisation:
https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
We should not require any of them in our inputs.

Again, normalisation may be a potential issue for parsers. Again, we can't
decide how they will want to handle it, but we should document it.

Also, if we change our minds about this in future, we can make tor relays
normalise the contents of their descriptors, and the authority implementation
will continue to work. And then we can make authorities reject non-normalised
inputs a few releases later.

Post by Nick Mathewson
[...]

Post by Nick Mathewson
How do we carry forward existing ASCII restrictions into UTF-8?

I don't understand this question.

I think it was intended as a general question.
Then I wrote some specific questions.

Post by Nick Mathewson

Ack.

Right. I don't see an obvious attack here, but we should keep it in mind.
Do you have a different suggestion of what to do here?

No, I really think this is like the potential parser bugs: not our problem.
People should get a better editor. And editors should get better.

Post by Nick Mathewson

Also let's consider all the nonprinting ASCII: it's already a
potential display problem if you're using a bad editor, or whatever.

Yes. Just like we can't decide how editors or parsers handle bad ASCII,
we can't decide how they handle bad (or new) Unicode.

T

--
Tim Wilson-Brown (teor)

teor2345 at gmail dot com
PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
ricochet:ekmygaiu4rzgsk6n
------------------------------------------------------------------------

Alex Xu

2018-01-10 01:36:22 UTC

Permalink

Quoting teor (2018-01-10 00:19:54)

Post by teor
These are called "Unicode Scalar Values".
https://www.unicode.org/glossary/#unicode_scalar_value
Let's reference that.

"Unicode Scalar Value" includes U+0, which I think we probably want to
exclude.

Post by teor

Post by Nick Mathewson
* each encoded with the shortest possible encoding.
* without any BOM
Are there other restrictions we should make? If so, how should we phrase them?

These seem fine, and not tied to a particular unicode version.
But I don't know enough about Unicode to know if there is anything else we should
specify.

Skimming through
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt, I think it
might be good to additionally forbid the code points listed at the end:
U+nFFF{E,F} for n = 0..10, and U+FDD0 through U+FDEF.

chelsea komlo

2017-11-24 21:05:30 UTC

Permalink

It is great that we are identifying places to improve support for Rust
in Tor.

Along this same line of thinking, are there other places in Tor where we
will need to move to supporting UTF-8? For example, should the statefile
be UTF-8 also?

Post by Nick Mathewson
Filename: 285-utf-8.txt
Title: Directory documents should be standardized as UTF-8
Author: Nick Mathewson
Created: 13 November 2017
Status: Open
1. Summary and motivation
People frequently want to include non-ASCII text in their router
descriptors. The Contact line is a favorite place to do this, but in
principle the platform line would also be pretty logical.
Unfortunately, there's no specified way to encode non-ASCII in our
directory documents.
Fortunately, almost everybody who does it, uses UTF-8 anyway.
As we move towards Rust support in Tor, we gain another motivation
for standarding on UTF-8, since Rust's native strings strongly prefer
UTF-8.
So, in this proposal, we describe a migration path to having all
directory documents be fully UTF-8.
2. Proposal
First, we should have Tor relays reject ContactInfo lines (and any
other lines copied directly into router descriptors) that are not
UTF-8.
At the same time, we should have authorities reject any router
descriptors or extrainfo documents that are not valid UTF-8.
Simultaneously, we can have all Tor instances reject all
non-directory-descriptor directory documents that are not UTF-8,
since none should exist today.
Finally, once the authorities have updated, we should have all Tor
instances reject all directory documents that are not UTF-8. (We
should not take this step until the authorities have upgraded, or
else the behavior of updated and non-updated clients could be
distinguished.)
2.1. Hidden service descriptors' encrypted bodies
For the encrypted bodies of hidden service descriptors, we cannot
reject them at the authority level, and so we need to take a slightly
different approach to prevent client fingerprinting attacks.
First, we should make Tor instances start warning about any hidden
service descriptors whose bodies, post-decryption, contain non-utf-8
plaintext. At the same time, we add a consensus parameter to
indicate that hidden service descriptors with non-utf-8 plantexts
should be rejected entirely: "reject-encrypted-non-utf-8". If that
parameter is set to 1, then hidden service clients will not only
warn, but reject the descriptors.
Once the vast majority of clients are running versions that support
the "reject-encrypted-non-utf-8" parameter, that parameter can be set
to 1.
_______________________________________________
tor-dev mailing list
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Nick Mathewson

2018-01-09 17:19:33 UTC

Permalink

It is great that we are identifying places to improve support for Rust in
Tor.
Along this same line of thinking, are there other places in Tor where we
will need to move to supporting UTF-8? For example, should the statefile be
UTF-8 also?

I think we could safely say that the torrc and state files need to be
UTF-8, though that wouldn't need a design proposal. We'd want to do a
two-phase transition, though: first warning, then disallowing.

We could also, at some point, specify that the control protocol needs
to be UTF-8 too, I think. That one would need a proposal, and a
transition plan.

--
Nick