Discussion:
[tor-dev] Proposal 281: downloading microdescriptors in bulk
Nick Mathewson
2017-08-11 17:36:00 UTC
Permalink
Filename: 281-bulk-md-download.txt
Title: Downloading microdescriptors in bulk
Author: Nick Mathewson
Created: 11-Aug-2017
Status: Draft

1. Introduction

This proposal describes a ways to download more microdescriptors
at a time, using fewer bytes.

Right now, to download N microdescriptors, the client must send
about 44*N bytes in its HTTP request. Because clients can request
microdescriptors in any combination, the directory caches cannot
pre-compress responses to these requests, and need to use less
space-efficient on-the-fly compression algorithms.

Under this proposal, clients simply say "Send me the
microdescriptors I need", given what I know.

2. Combined microdescriptor downloads

2.1. By diff

If a client has a consensus with base64 sha3-256 digest X, and it
previously had a consensus with base64 sha3-256 digests Y then
it may request all the microdescriptors listed in X but not Y,
by asking for the resource:
/tor/micro/diff/X/Y

Clients SHOULD only ask for this resource compressed.

Caches MUST NOT answer this request unless they recognize the
consensus with digest X, and digest Y.
digest Y. If answering, caches MUST reply with all of the
microdescriptors that the cache holds that were listed by
consensus X, and MUST omit all the microdescriptors that were
omitted listed in consensus Y.

2.2. By consensus:

If a client has fewer than NMNM% of the microdescriptors listed in a
consensus X, it should fetch the resource
/tor/micro/full/X

Clients SHOULD only ask for this resource compressed.

Caches MUST NOT answer this request unless they recognize the
consensus with digest X. They should send all the microdescriptors
they have that are listed in that consensus.

2.3. When to make these requests

Clients should decide to use this format in preference to the
old download-by-digest format if the consensus X lists their
preferred directory cache as using a new DirCache subprotocol
version. (See 5 below.)

3. Performance analysis

This is a back-of-the-envelope analysis using a month's worth of
consensus documents, and a randomly chosen sample of
microdescriptors.


On average, about 0.5% of the microdescriptors change between any
two consensuses. Call it 50. That means 50*43 bytes == 2150
bytes to request the microdescriptors. It means ~24530 bytes of
microdescriptors downloaded, compressed to ~13687 bytes by zstd.

With this proposal, we're down to 86 bytes for the request, and we
can precompute the compressed output, making it save to use lzma2,
getting a compressed result more like 13362.

It appears that this change would save about 15% for incremental
microdescriptor downloads, most of that coming from the reduction
in request size.

For complete downloads, a complete set of microdescriptors is about
7700 microdesciptors long. That makes the total number of bytes
for the requests 7700*43 == 331100 bytes. The response, if
compressed with lzma instead of zstd, would fall from 1659682 to
1587804 bytes, for a total savings of 20%.


5. Compatibility

Caches supporting this download protocol need to advertise
support of a new DirCache subprotocol version.
teor
2017-08-21 04:42:40 UTC
Permalink
Post by Nick Mathewson
Filename: 281-bulk-md-download.txt
Title: Downloading microdescriptors in bulk
Author: Nick Mathewson
Created: 11-Aug-2017
Status: Draft
1. Introduction
This proposal describes a ways to download more microdescriptors
at a time, using fewer bytes.
Right now, to download N microdescriptors, the client must send
about 44*N bytes in its HTTP request. Because clients can request
microdescriptors in any combination, the directory caches cannot
pre-compress responses to these requests, and need to use less
space-efficient on-the-fly compression algorithms.
Under this proposal, clients simply say "Send me the
microdescriptors I need", given what I know.
2. Combined microdescriptor downloads
2.1. By diff
If a client has a consensus with base64 sha3-256 digest X, and it
previously had a consensus with base64 sha3-256 digests Y then
it may request all the microdescriptors listed in X but not Y,
/tor/micro/diff/X/Y
Clients SHOULD only ask for this resource compressed.
Caches MUST NOT answer this request unless they recognize the
consensus with digest X, and digest Y.
digest Y.
Extra "digest Y. "
Post by Nick Mathewson
If answering, caches MUST reply with all of the
microdescriptors that the cache holds that were listed by
consensus X, and MUST omit all the microdescriptors that were
omitted listed in consensus Y.
What happens if the consensus versions are different?
In particular, what happens if the microdesc algorithms are different
in these consensus versions?

(What should happen is that the diff is larger than normal, because
most microdesc hashes have changed. We should have a test for this.)
Post by Nick Mathewson
If a client has fewer than NMNM% of the microdescriptors listed in a
consensus X, it should fetch the resource
/tor/micro/full/X
Clients SHOULD only ask for this resource compressed.
Caches MUST NOT answer this request unless they recognize the
consensus with digest X. They should send all the microdescriptors
they have that are listed in that consensus.
2.3. When to make these requests
Clients should decide to use this format in preference to the
old download-by-digest format if the consensus X lists their
preferred directory cache as using a new DirCache subprotocol
version. (See 5 below.)
Don't clients have 3 preferred directory caches?

What about fallback directory mirrors?

We don't care about diff/X/Y - there is no previous consensus.
But knowing when a fallback supports full/X could be handy.
Or do we deliberately want to use the legacy protocol to
bootstrap, so a single cache can't lie to us?

What about bridge clients?
Can they find out from the bridge descriptor?
Post by Nick Mathewson
3. Performance analysis
This is a back-of-the-envelope analysis using a month's worth of
consensus documents, and a randomly chosen sample of
microdescriptors.
On average, about 0.5% of the microdescriptors change between any
two consensuses. Call it 50. That means 50*43 bytes == 2150
bytes to request the microdescriptors. It means ~24530 bytes of
microdescriptors downloaded, compressed to ~13687 bytes by zstd.
With this proposal, we're down to 86 bytes for the request, and we
can precompute the compressed output, making it save to use lzma2,
getting a compressed result more like 13362.
It appears that this change would save about 15% for incremental
microdescriptor downloads, most of that coming from the reduction
in request size.
For complete downloads, a complete set of microdescriptors is about
7700 microdesciptors long. That makes the total number of bytes
for the requests 7700*43 == 331100 bytes. The response, if
compressed with lzma instead of zstd, would fall from 1659682 to
1587804 bytes, for a total savings of 20%.
5. Compatibility
Caches supporting this download protocol need to advertise
support of a new DirCache subprotocol version.
T
--
Tim Wilson-Brown (teor)

teor2345 at gmail dot com
PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
ricochet:ekmygaiu4rzgsk6n
------------------------------------------------------------------------
Nick Mathewson
2017-08-28 15:32:21 UTC
Permalink
[...]
Post by teor
Post by Nick Mathewson
Caches MUST NOT answer this request unless they recognize the
consensus with digest X, and digest Y.
digest Y.
Extra "digest Y. "
Fixed, thanks!
Post by teor
Post by Nick Mathewson
If answering, caches MUST reply with all of the
microdescriptors that the cache holds that were listed by
consensus X, and MUST omit all the microdescriptors that were
omitted listed in consensus Y.
What happens if the consensus versions are different?
In particular, what happens if the microdesc algorithms are different
in these consensus versions?
(What should happen is that the diff is larger than normal, because
most microdesc hashes have changed. We should have a test for this.)
Right. I've tried to clarify. Now it says "(For the purposes of this proposal,
microdescriptors are "the same" if they are textually identical
and have the same digest.)"


[...]
Post by teor
Post by Nick Mathewson
2.3. When to make these requests
Clients should decide to use this format in preference to the
old download-by-digest format if the consensus X lists their
preferred directory cache as using a new DirCache subprotocol
version. (See 5 below.)
Don't clients have 3 preferred directory caches?
Edited to say additionally:

When a client has some preferred directory caches that support
this subprotocol and some that do not, it chooses one at random,
and uses these requests if that one supports this subprotocol.
Post by teor
What about fallback directory mirrors?
We don't care about diff/X/Y - there is no previous consensus.
But knowing when a fallback supports full/X could be handy.
Or do we deliberately want to use the legacy protocol to
bootstrap, so a single cache can't lie to us?
When using fallback mirrors, the client downloads the consensus before
downloading microdescriptors. Having downloaded the consensus, it
should know whether the cache supports this protocol.
Post by teor
What about bridge clients?
Can they find out from the bridge descriptor?
Yup; the subprotocol information is listed there too.


Thanks for the review!
--
Nick
Loading...