[tor-dev] Proposal 280: Privacy-Preseving Statistics with Privcount in Tor

Discussion:

Nick Mathewson

2017-08-07 17:47:33 UTC

Hi! Tim helped me write this this draft last week, and I shared it
with the PrivCount authors. I've already gotten some good comments
from Aaron, which I'll repost in a followup message, with his
permission.

================================

Filename: 280-privcount-in-tor.txt
Title: Privacy-Preseving Statistics with Privcount in Tor
Author: Nick Mathewson, Tim Wilson-Brown
Created: 02-Aug-2017
Status: Draft

0. Acknowledgments

Tariq Elahi, George Danezis, and Ian Goldberg designed and implemented
the PrivEx blinding scheme. Rob Jansen and Aaron Johnson extended
PrivEx's differential privacy guarantees to multiple counters in
PrivCount:

https://github.com/privcount/privcount/blob/master/README.markdown#research-background

Rob Jansen and Tim Wilson-Brown wrote the majority of the experimental
PrivCount code, based on the PrivEx secret-sharing variant. This
implementation includes contributions from the PrivEx authors, and
others:

https://github.com/privcount/privcount/blob/master/CONTRIBUTORS.markdown

1. Introduction and scope

PrivCount is a privacy-preserving way to collect aggregate statistics
about the Tor network without exposing the statistics from any single
Tor relay.

This document describes the behavior of the in-Tor portion of the
PrivCount system. It DOES NOT describe the counter configurations,
or any other parts of the system. (These will be covered in separate
proposals.)

2. PrivCount overview

Here follows an oversimplified summary of PrivCount, with enough
information to explain the Tor side of things. The actual operation
of the non-Tor components is trickier than described below.

All values in the scheme below are 64-bit unsigned integers; addition
and subtraction are modulo 2^64.

In PrivCount, a Data Collector (in this case a Tor relay) shares
numeric data with N different Tally Reporters. (A Tally Reporter
performs the summing and unblinding roles of the Tally Server and Share
Keeper from experimental PrivCount.)

All N Tally Reporters together can reconstruct the original data, but
no (N-1)-sized subset of the Tally Reporters can learn anything about
the data.

(In reality, the Tally Reporters don't reconstruct the original data
at all! Instead, they will reconstruct a _sum_ of the original data
across all participating relays.)

To share data, for each value X to be shared, the relay generates
random values B_1 though B_n, and shares each B_i secretly with a
single Tally Reporter. The relay then publishes Y = X + SUM(B_i) + Z,
where Z is a noise value taken at random from a gaussian distribution.
The Tally Reporters can reconstruct X+Z by securely computing SUM(B_i)
across all contributing Data Collectors. (Tally Reporters MUST NOT
share individual B_i values: that would expose the underlying relay
totals.)

In order to prevent bogus data from corrupting the tally, the Tor
relays and the Tally Reporters perform multiple "instances" of this
algorithm, randomly sampling in each relays. The relay sends multiple
Y values for each measurement, built with different sets of B_i.
These "instances" are numbered in order from 1 to R.

So that the system will still produce results in the event of a single
Tally Reporter failure, these instances are distributed across multiple
subsets of Tally Reporters.

Below we describe a data format for this.

3. The document format

This document format builds on the line-based directory format used
for other tor documents, described in Tor's dir-spec.txt.

Using this format, we describe two kinds of documents here: a
"counters" document that publishes all the Y values, and a "blinding"
document that describes the B_i values. But see "An optimized
alternative" below.

The "counters" document has these elements:

"privctr-dump-format" SP VERSION SP SigningKey

[At start, exactly once]

Describes the version of the dump format, and provides an ed25519
signing key to identify the relay. The signing key is encoded in
base64 with padding stripped. VERSION is "alpha" now, but should
be "1" once this document is finalized.

[[[TODO: Do we need a counter version as well?

Noise is distributed across a particular set of counters,
to provide differential privacy guarantees for those counters.
Reducing noise requires a break in the collection.
Adding counters is ok if the noise on each counter
monotonically increases. (Removing counters always reduces
noise.)

We also need to work out how to handle instances with mixed
Tor versions, where some Data Collectors report different
counters to other Data Collectors. (The blinding works if we
substitute zeroes for missing counters on Tally Reporters.
But we also need to add noise in this case.)

-teor
]]]

"starting-at" SP IsoTime

[Exactly once]

The start of the time period when the statistics here were
collected.

"ending-at" SP IsoTime

[Exactly once]

The end of the time period when the statistics here were
collected.

"num-instances" SP Number

[Exactly once]

The number of "instances" that the relay used (see above.)

"tally-reporter" SP Identifier SP Key SP InstanceNumbers

[At least twice]

The curve25519 public key of each Tally Reporter that the relay
believes in. (If the list does not match the list of
participating tally reporters, they won't be able to find the
relay's values correctly.) The identifiers are non-space,
non-nul character sequences. The Key values are encoded in
base64 with padding stripped; they must be unique within each
counters document. The InstanceNumbers are comma-separated lists
of decimal integers from 0 to (num-instances - 1), in ascending
order.

Keyword ":" SP Int SP Int SP Int ...

[Any number of times]

The Y values for a single measurement. There are num-instances
such Y values for each measurement. They are 64-bit unsigned
integers, expressed in decimal.

The "Keyword" denotes which measurement is being shared. Keyword
MAY be any sequence of characters other than colon, nul, space,
and newline, though implementators SHOULD avoid getting too
creative here. Keywords MUST be unique within a single document.
Tally Reporters MUST handle unrecognized keywords. Keywords MAY
appear in any order.

It is safe to send the blinded totals for each instance to every
Tally Reporter. To unblind the totals, a Tally Reporter needs:
* a blinding document from each relay in the instance, and
* the per-counter blinding sums from the other Tally Reporters
in their instance.

[[[TODO: But is it safer to create a per-instance counters
document? -- teor]]]

The semantics of individual measurements are not specified here.

"signature" SP Signature

[At end, exactly once]

The Ed25519 signature of all the fields in the document, from the
first byte, up to but not including the "signature" keyword here.
The signature is encoded in base64 with padding stripped.

The "blinding" document has these elements:

"privctr-secret-offsets" SP VERSION SP SigningKey

[At start, exactly once.]

The VERSION and SigningKey parameters are the same as for
"privctr-dump-format".

"instances" SP Numbers

[Exactly once]

The instances that this Tally Reporter handles.
They are given as comma-separated decimal integers, as in the
"tally-reporter" entry in the counters document. They MUST
match the instances listed in the counters document.

[[[TODO: this is redundant. Specify the constraint instead? --teor]]]

"num-counters" SP Number

[Exactly once]

The number of counters that the relay used in its counters
document. This MUST be equal to the number of keywords in the
counters document.

[[[TODO: this is redundant. Specify the constraint instead? --teor]]]

"tally-reporter-pubkey" SP Key

[Exactly once]

The curve25519 public key of the tally reporter who is intended
to receive an decrypt this document. The key is base64-encoded
with padding stripped.

"count-document-digest" SP "sha3" Digest NL
"-----BEGIN ENCRYPTED DATA-----" NL
Data
"-----END ENCRYPTED DATA-----" NL

[Exactly once]

The SHA3-256 digest of the count document corresponding to this
blinding document. The digest is base64-encoded with padding
stripped. The data encodes the blinding values (See "The
Blinding Values") below, and is encrypted to the tally reporter's
public key using the hybrid encryption algorithm described below.

"signature" SP Signature

[At end, exactly once]

The Ed25519 signature of all the fields in the document, from the
first byte, up to but not including the "signature" keyword here.
The signature is encoded in base64 with padding stripped.

4. The Blinding Values

The "Data" field of the blinding documents above, when decrypted,
yields a sequence of 64-bit binary values, encoded in network
(big-endian) order. There are C * R such values, where C is the number
of keywords in the count document, and R is the number of instances
that the Tally Reporter participates in. The client generates all of
these values uniformly at random.

For each keyword in the count document, in the order specified by the
count document, the decrypted data holds R*8 bytes for the specified
instance of that keyword's blinded counter.

For example: if the count document lists the keywords "b", "x", "g",
and "a" (in that order), and lists instances "0", and "2", then the
decrypted data will hold the blinding values in this order:
b, instance 0
b, instance 2
x, instance 0
x, instance 2
g, instance 0
g, instance 2
a, instance 0
a, instance 2

4. Implementation Notes

A relay should, when starting a new round, generate all the blinding
values and noise values in advance. The relay should then use these
values to compute Y_0 = SUM(B_i) + Z for each instance of each
counter. Having done this, the relay MUST encrypt the blinding values
to the public key of each tally reporter, and wipe them from memory.

5. The hybrid encryption algorithm

We use a hybrid encryption scheme above, where items can be encrypted
to a public key. We instantiate it as follows, using curve25519
public keys.

To encrypt a plaintext M to a public key PK1
1. the sender generates a new ephemeral keypair sk2, PK2.
2. The sender computes the shared diffie hellman secret
SEED = (sk2 * PK1).

3. The sender derives 64 bytes of key material as
SHAKE256(TEXT | SEED)[...64]
where "TEXT" is "Expand curve25519 for privcount encryption".

The first 32 bytes of this is an aes key K1;
the second 32 bytes are a mac key K2.

4. The sender computes a ciphertext C as AES256_CTR(K1, M)

5. The sender computes a MAC as
SHA3_256([00 00 00 00 00 00 00 20] | K2 | C)

6. The hybrid-encrypted text is PK2 | MAC | C.

6. An optimized alternative

As an alternative, the sequences of blinding values is NOT transmitted
to the tally reporters. Instead the client generates a single
ephemeral keypair sk_c, PK_c, and places the public key in its counts
document. It does this each time a new round begins.

For each tally reporter with public key PK_i, the client then does
the handshake sk_c * PK_i to compute SEED_i.

The client then generates the blinding values for that tally reporter
as SHAKE256(SEED_i)[...R*C*8].

After initializing the counters to Y_0, the client can discard the
blinding values and sk_c.

Later, the tally reporters can reconstruct the blinding values as
SHAKE256(sk_i * PK_c)[...]

This alternative allows the client to transmit only a single public
key, when previously it would need to transmit a complete set of
blinding factors for each tally reporter. Further, the alternative
does away with the need for blinding documents altogether. It is,
however, more sensitive to any defects in SHAKE256 than the design
above. Like the rest of this design, it would need rethinking if we
want to expand this scheme to work with anonymous data collectors,
such as Tor clients.

Nick Mathewson

2017-08-07 17:50:37 UTC

Permalink

[reposting this message with permission. It is a reply that I sent to
Aaron, where I quoted an email from him about this proposal. Tim and
Aaron had additional responses, which I'll let them quote here or not
as they think best.]

On Sat, Aug 5, 2017 at 1:38 PM, Aaron Johnson
<***@nrl.navy.mil> wrote:
[...]

- There are a couple of documents in PrivCount that are missing: the deployment document and the configuration document. These set up things like the identities/public keys of the parties, the planned time of the measurements, the statistics to be computed, the noise levels to use. They were required to be agreed on by all parties. These values must be agreed upon by all parties (in some cases, such as disagreement about noise, the security/privacy guarantees could otherwise fail). How do you plan to replace these?

So, I hadn't planned to remove these documents, so much as to leave
them out of scope for this proposal. Right now, in the code, there's
no actual way to configure any of these things.

Thinking aloud:

I think we should engineer that piece by piece. We already have the
consensus directory system as a way to communicate information that
needs to be securely updated, and where everybody needs to update at
once, so I'd like to reuse that to the extent that it's appropriate.

For some parts of it, I think we can use versions and named sets. For
other parts, we want to be flexible, so that we can rotate keys
frequently, react to tally reporters going offline, and so on. There
may need to be more than one distribution mechanism for this metainfo.

These decisions will also be application-dependent: I've been thinking
mainly of "always-on" applications, like network metrics, performance
measurement, anomaly-detection [*], and so on. But I am probably
under-engineering for
"time-limited" applications like short-term research experiments.

- I believe that instead of dealing with Tally Reporter (TR) failures using multiple subsets, you could instead simply use (t,n) secret sharing, which would survive any t-1 failures (but also allow any subset of size t to determine the individual DC counts). The DC would create one blinding value B and then use Shamir secret sharing to send a share of B to each TR. To aggregate, each TR would first add together its shares, which would yield a share of the sum of the blinding values from all DCs. Then the TRs could simply reconstruct that sum publicly, which, when subtracted from the public, blinded, noisy, counts would reveal the final noisy sum. This would be more efficient than having each TR publish multiple potential inputs to different subsets of TRs.

So, I might have misunderstood the purpose here : I thought that the
instances were to handle misbehaving DCs as well as malfunctioning
TRs.

- Storing at the DC the blinded values encrypted to the TRs seems to violate forward privacy in that if during the measurement the adversary compromises a DC and then later (even after the final release) compromises the key of a TR, the adversary could determine the state of the DC’s counter at the time of compromise. The also applies to the optimization in Sec. 6 where the blinding values where a shared secret is hashed to produce the blinding values.

Well, the adversary would need to compromise the key of _every_ TR in
at least one instance, or they couldn't recover the actual counters.

I guess we could, as in the original design (IIUC), send the encrypted
blinding values (or public DH key in sec 6) immediately from the DC
when it generates them, and then throw them away client-side. Now the
adversary would need to break into all the TRs while they were holding
these encrypted blinding values.

Or, almost equivalently, I think we could make the TR public
encryption keys only get used for one round. That's good practice in
general, and it's a direction I generally like.

And of course, DCs should use a forward-secure TLS for talking to the
TRs, so that an eavesdropper doesn't learn anything.

[*] One anomaly detection mechanism I've been thinking of is to look
at different "protocol-warn" log messages. These log messages
indicate that some third party is not complying with the protocol.
They're usually logged at info, since there's nothing an operator can
do about them, but it would be good for us to get notification if some
of them spike all of a sudden.

teor

2017-09-11 23:44:38 UTC

Permalink

Post by Nick Mathewson
[reposting this message with permission. It is a reply that I sent to
Aaron, where I quoted an email from him about this proposal. Tim and
Aaron had additional responses, which I'll let them quote here or not
as they think best.]

[Re-posting this edited thread with permission. It's a conversation that
continues on from the last re-post.]

Post by Nick Mathewson

Post by Nick Mathewson
...

So, I might have misunderstood the purpose here : I thought that the
instances were to handle misbehaving DCs as well as malfunctioning
TRs.

The mechanism you described (having each DC report different encrypted counters for different subsets of TRs) doesnât handle failed (i.e. crashed) DCs. To handle failed DCs in the scheme you describe (with the blinding values started encrypted in a document), you can just have the TRs agree on which DCs succeeded at the end of the measurement and only use blinding values from those DCs. So you donât need multiple TR subsets to handle failed DCs.

Each *subset* of DCs reports to a subset of the TRs.
This deals with malicious and outlying DC values, as well as failed DCs.
And it deals with failed TRs as well.

This seems unnecessary and inefficient. DC failures can be handled by the TRs at the end. TR failures can be handled using Shamir secret sharing.

1. The TRs (aka the SKs) only need to be online long enough to receive their blinding values, add them, and send the sum out. Therefore a measurement can recover from a failed TR if its blinding values are persistently stored somewhere and if *at any point* the TR can be restarted.

In the event of key compromise, or operator trust failure, or operator
opt-out, the TR can never be restarted (securely).

If these are real concerns, then you should use Shamir secret sharing across the TRs. Honestly, they seem unlikely to me, and the cost of missing one round of statistics seems low. However, the cost of dealing with them is also low, and so you might as well do it!

2. Handling DC failures is trivial. As mentioned above, the TRs simply wait until the end to determine which DCs succeeded and should have their blinding values included in the sum.

How would you do this securely?
Any scheme I think of allows a malicious TR to eliminate particular
relays.

A malicious TR can in any case eliminate a particular relay by destroying the outputs of any subsets containing that relay. Destroying an output is done by using a random value as the blinding value, making the output random (and likely obviously so). The privacy comes from the differentially private noise, and because TRs wonât agree on subsets that would reduce the added noise below the desired amount, the adversary couldnât break privacy by eliminate particular relays. Moreover, if you wanted, you could use a secure broadcast (e.g. the Dolev-Strong protocol) to enable the TRs to agree on the union of DCs that any one of the TRs received the counters documents from. Such a secure broadcast in used in PrivCount to get consensus on the the deployment and configuration documents.

Also, one thing I forgot to mention in my last email is that you have removed the Tally Server, which is an untrusted entity that essentially acts as a public bulletin board. Without such a collection point, who obtains the outputs of the TRs and computes the final result?

We'll work with Tor metrics to decide on a mechanism for taking the
counts from each TR subset, and turning them into a final count.
This would probably be some kind of median, possibly discarding
nonsensical values first.

If you plan to release multiple values from different DC subsets to handle nonsensical values, then you will have to increase the noise to handle the additional statistics. This can be done just as with handling DC failures: TRs agree on several DC subsets from among the DCs that didnât fail and then release Ã¥ blinding value sum for each subset. Note that DCs actually only need to send one set of blinding values and one set of counters to the TRs.

- Storing at the DC the blinded values encrypted to the TRs seems to violate forward privacy in that if during the measurement the adversary compromises a DC and then later (even after the final release) compromises the key of a TR, the adversary could determine the state of the DCâs counter at the time of compromise. The also applies to the optimization in Sec. 6 where the blinding values where a shared secret is hashed to produce the blinding values.

Post by Nick Mathewson
Well, the adversary would need to compromise the key of _every_ TR in
at least one instance, or they couldn't recover the actual counters.

Thatâs true.

Post by Nick Mathewson
I guess we could, as in the original design (IIUC), send the encrypted
blinding values (or public DH key in sec 6) immediately from the DC
when it generates them, and then throw them away client-side. Now the
adversary would need to break into all the TRs while they were holding
these encrypted blinding values.

Right, that is the original design and would provide a bit more forward security than in the current spec.

Post by Nick Mathewson
Or, almost equivalently, I think we could make the TR public
encryption keys only get used for one round. That's good practice in
general, and it's a direction I generally like.

That would work, too.

Post by Nick Mathewson
[*] One anomaly detection mechanism I've been thinking of is to look
at different "protocol-warn" log messages. These log messages
indicate that some third party is not complying with the protocol.
They're usually logged at info, since there's nothing an operator can
do about them, but it would be good for us to get notification if some
of them spike all of a sudden.

Really interesting idea! Rob and I are interested in looking for attacks on the Tor network using metrics as well. This kind of anomaly reminds of the RELAY_EARLY attack that you wrote a detector for.

Karsten Loesing

2017-09-12 18:27:12 UTC

Permalink

Hi Tim,

unfortunately, nobody from the metrics team can attend today's proposal
280 discussion in a few hours.

That's why we decided to provide some written feedback here.

We didn't find anything problematic in the proposal from the view of Tor
metrics.

This is due to the narrow scope covering only the communication protocol
between tally servers and relays, as we understand it.

All topics related to deriving counts, calculating final results, and
anything else that could affect currently running metrics code are
explicitly excluded or not mentioned.

If we misunderstood the scope and there is actually a part that covers
current or future metrics code, please let us know, and we'll check that
again.

Thanks for working on privacy-preserving statistics in Tor!

All the best,
Karsten

Post by teor