Discussion:
[tor-dev] Proposal: The move to two guard nodes
Mike Perry
2018-03-31 06:52:51 UTC
Permalink
In-line below for ease of comment. Also available at:
https://gitweb.torproject.org/user/mikeperry/torspec.git/tree/proposals/xxx-two-guard-nodes.txt?h=twoguards

===========================

Filename: xxx-two-guard-nodes.txt
Title: The move to two guard nodes
Author: Mike Perry
Created: 2018-03-22
Supersedes: Proposal 236

0. Background

Back in 2014, Tor moved from three guard nodes to one guard node[1,2,3].

We made this change primarily to limit points of observability of entry
into the Tor network for clients and onion services, as well as to
reduce the ability of an adversary to track clients as they move from
one internet connection to another by their choice of guards.


1. Proposed changes

1.1. Switch to two guards per client

When this proposal becomes effective, clients will switch to using
two guard nodes. The guard node selection algorithms of Proposal 271
will remain unchanged. Instead of having one primary guard "in use",
Tor clients will always use two.

This will be accomplished by setting the guard-n-primary-guards-to-use
consensus parameter to 2, as well as guard-n-primary-guards to 2.
(Section 3.1 covers the reason for both parameters). This is equivalent
to using the torrc option NumEntryGuards=2, which can be used for
testing behavior prior to the consensus update.

1.2. Enforce Tor's path restrictions across this guard layer

In order to ensure that Tor can always build circuits using two guards
without resorting to a third, they must be chosen such that Tor's path
restrictions could still build a path with at least one of them,
regardless of the other nodes in the path.

In other words, we must ensure that both guards are not chosen from the
same /16 or the same node family. In this way, Tor will always be able to
build a path using these guards, preventing the use of a third guard.


2. Discussion

2.1. Why two guards?

The main argument for switching to two guards is that because of Tor's
path restrictions, we're already using two guards, but we're using them
in a suboptimal and potentially dangerous way.

Tor's path restrictions enforce the condition that the same node cannot
appear twice in the same circuit, nor can nodes from the same /16 subnet
or node family be used in the same circuit.

Tor's paths are also built such that the exit node is chosen first and
held fixed during guard node choice, as are the IP, HSDIR, and RPs for
onion services. This means that whenever one of these nodes happens to
be the guard[4], or be in the same /16 or node family as the guard, Tor
will build that circuit using a second "primary" guard, as per proposal
271[7].

Worse still, the choice of RP, IP, and exit can all be controlled by an
adversary (to varying degrees), enabling them to force the use of a
second guard at will.

Because this happens somewhat infrequently in normal operation, a fresh
TLS connection will typically be created to the second "primary" guard,
and that TLS connection will be used only for the circuit for that
particular request. This property makes all sorts of traffic analysis
attacks easier, because this TLS connection will not benefit from any
multiplexing.

This is more serious than traffic injection via an already in-use
guard because the lack of multiplexing means that the data retention
level required to gain information from this activity is very low, and
may exist for other reasons. To gain information from this behavior, an
adversary needs only connection 5-tuples + timestamps, as opposed to
detailed timeseries data that is polluted by other concurrent activity
and padding.

In the most severe form of this attack, the adversary can take a suspect
list of Tor client IP addresses (or the list of all Guard node IP addresses)
and observe when secondary Tor connections are made to them at the time when
they cycle through all guards as RPs for connections to an onion
service. This adversary does not require collusion on the part of observers
beyond the ability to provide 5-tuple connection logs (which ISPs may retain
for reasons such as netflow accounting, IDS, or DoS protection systems).

A fully passive adversary can also make use of this behavior. Clients
unlucky enough to pick guard nodes in heavily used /16s or in large node
families will tend to make use of a second guard more frequently even
without effort from the adversary. In these cases, the lack of
multiplexing also means that observers along the path to this secondary
guard gain more information per observation.

2.2. Why not MOAR guards?

We do not want to increase the number of observation points for client
activity into the Tor network[1]. We merely want better multiplexing for
the cases where this already happens.

2.3. Can you put some numbers on that?

The Changing of the Guards[13] paper studies this from a few different
angles, but one of the crucially missing graphs is how long a client
can expect to run with N guards before it chooses a malicious guard.

However, we do have tables in section 3.2.1 of proposal 247 that cover
this[14]. There are three tables there: one for a 1% adversary, one for
a 5% adversary, and one for a 10% adversary. You can see the probability
of adversary success for one and two guards in terms of the number of
rotations needed before the adversary's node is chosen. Not surprisingly,
the two guard adversary gets to compromise clients roughly twice as
quickly, but the timescales are still rather large even for the 10%
adversary: they only have 50% chance of success after 4 rotations, which
will take about 14 months with Tor's 3.5 month guard rotation.

2.4. What about guard fingerprinting?

More guards also means more fingerprinting[8]. However, even one guard
may be enough to fingerprint a user who moves around in the same area,
if that guard is low bandwidth or there are not many Tor users in that
area.

Furthermore, our use of separate directory guards (and three of them)
means that we're not really changing the situation much with the
addition of another regular guard. Right now, directory guard use alone
is enough to track all Tor users across the entire world.

While the directory guard problem could be fixed[12] (and should be
fixed), it is still the case that another mechanism should be used for
the general problem of guard-vs-location management[9].


3. Alternatives

There are two other solutions that also avoid the use of secondary guard
in the path restriction case.

3.1. Eliminate path restrictions entirely

If Tor decided to stop enforcing /16, node family, and also allowed the
guard node to be chosen twice in the path, then under normal conditions,
it should retain the use of its primary guard.

This approach is not as extreme as it seems on face. In fact, it is hard
to come up with arguments against removing these restrictions. Tor's
/16 restriction is of questionable utility against monitoring, and it can
be argued that since only good actors use node family, it gives influence
over path selection to bad actors in ways that are worse than the benefit
it provides to paths through good actors[10,11].

However, while removing path restrictions will solve the immediate
problem, it will not address other instances where Tor temporarily opts
use a second guard due to congestion, OOM, or failure of its primary
guard, and we're still running into bugs where this can be adversarially
controlled or just happen randomly[5].

While using two guards means twice the surface area for these types of
bugs, it also means that instances where they happen simultaneously on
both guards (thus forcing a third guard) are much less likely than with
just one guard. (In the passive adversary model, consider that one guard
fails at any point with probability P1. If we assume that such passive
failures are independent events, both guards would fail concurrently
with probability P1*P2. Even if the events are correlated, the maximum
chance of concurrent failure is still MIN(P1,P2)).

Note that for this analysis to hold, we have to ensure that nodes that
are at RESOURCELIMIT or otherwise temporarily unresponsive do not cause
us to consider other primary guards beyond than the two we have chosen.
This is accomplished by setting guard-n-primary-guards to 2 (in addition
to setting guard-n-primary-guards-to-use to 2). With this parameter
set, the proposal 271 algorithm will avoid considering more than our two
guards, unless *both* are down at once.

3.2. No Guard-flagged nodes as exit, RP, IP, or HSDIRs

Similar to 3.1, we could instead forbid the use of Guard-flagged nodes
for the exit, IP, RP, and HSDIR positions.

This solution has two problems: First, like 3.1, it also does not handle
the case where resource exhaustion could force the use of a second
guard. Second, it requires clients to upgrade to the new behavior and
stop using Guard flagged nodes before it can be deployed.


4. The future is confluxed

An additional benefit of using a second guard is that it enables us to
eventually use conflux[6].

Conflux works by giving circuits a 256bit cookie that is sent to the
exit/RP, and circuits that are then built to the same exit/RP with the
same cookie can then be fused together. Throughput estimates are used to
balance traffic between these circuits, depending on their performance.

We have unfortunately signaled to the research community that conflux is
not worth pursuing, because of our insistence on a single guard. While
not relevant to this proposal (indeed, conflux requires its own proposal
and also concurrent research), it is worth noting that whichever way we
go here, the door remains open to conflux because of its utility against
similar issues.

If our conflux implementation includes packet acking, then circuits can
still survive the loss of one guard node due to DoS, OOM, or other
failures because the second half of the path will remain open and
usable (see the probability of concurrent failure arguments in Section
3.1).

If exits remember this cookie for a short period of time after the last
circuit is closed, the technique can be used to protect against
DoS/OOM/guard downtime conditions that take down both guard nodes or
destroy many circuits to confirm both guard node choices. In these
cases, circuits could be rebuilt along an alternate path and resumed
without end-to-end circuit connectivity loss. This same technique will
also make things like ephemeral bridges (ie Snowflake/Flashproxy) more
usable, because bridge uptime will no longer be so crucial to usability.
It will also improve mobile usability by allowing us to resume
connections after mobile Tor apps are briefly suspended, or if the user
switches between cell and wifi networks.

Furthermore, it is likely that conflux will also be useful against traffic
analysis and congestion attacks. Since the load balancing is dynamic and
hard to predict by an external observer and also increases overall
traffic multiplexing, traffic correlation and website traffic
fingerprinting attacks will become harder, because the adversary can no
longer be sure what percentage of the traffic they have seen (depending
on their position and other potential concurrent activity). Similarly,
it should also help dampen congestion attacks, since traffic will
automatically shift away from a congested guard.



References:

1. https://blog.torproject.org/improving-tors-anonymity-changing-guard-parameters
2. https://trac.torproject.org/projects/tor/ticket/12206
3. https://gitweb.torproject.org/torspec.git/tree/proposals/236-single-guard-node.txt
4. https://trac.torproject.org/projects/tor/ticket/14917
5. https://trac.torproject.org/projects/tor/ticket/25347#comment:14
6. https://www.cypherpunks.ca/~iang/pubs/conflux-pets.pdf
7. https://gitweb.torproject.org/torspec.git/tree/proposals/271-another-guard-selection.txt
8. https://trac.torproject.org/projects/tor/ticket/9273#comment:3
9. https://tails.boum.org/blueprint/persistent_Tor_state/
10. https://trac.torproject.org/projects/tor/ticket/6676#comment:3
11. https://bugs.torproject.org/15060
12. https://trac.torproject.org/projects/tor/ticket/10969
13. https://www.freehaven.net/anonbib/cache/wpes12-cogs.pdf
14. https://gitweb.torproject.org/torspec.git/tree/proposals/247-hs-guard-discovery.txt#n179
--
Mike Perry
Nick Mathewson
2018-04-03 18:30:25 UTC
Permalink
Post by Mike Perry
https://gitweb.torproject.org/user/mikeperry/torspec.git/tree/proposals/xxx-two-guard-nodes.txt?h=twoguards
===========================
Filename: xxx-two-guard-nodes.txt
Title: The move to two guard nodes
Author: Mike Perry
Created: 2018-03-22
Supersedes: Proposal 236
Added as proposal 291!
George Kadianakis
2018-04-10 15:33:58 UTC
Permalink
Post by Mike Perry
https://gitweb.torproject.org/user/mikeperry/torspec.git/tree/proposals/xxx-two-guard-nodes.txt?h=twoguards
===========================
Filename: xxx-two-guard-nodes.txt
Title: The move to two guard nodes
Author: Mike Perry
Created: 2018-03-22
Supersedes: Proposal 236
<snip>
3.1. Eliminate path restrictions entirely
If Tor decided to stop enforcing /16, node family, and also allowed the
guard node to be chosen twice in the path, then under normal conditions,
it should retain the use of its primary guard.
This approach is not as extreme as it seems on face. In fact, it is hard
to come up with arguments against removing these restrictions. Tor's
/16 restriction is of questionable utility against monitoring, and it can
be argued that since only good actors use node family, it gives influence
over path selection to bad actors in ways that are worse than the benefit
it provides to paths through good actors[10,11].
However, while removing path restrictions will solve the immediate
problem, it will not address other instances where Tor temporarily opts
use a second guard due to congestion, OOM, or failure of its primary
guard, and we're still running into bugs where this can be adversarially
controlled or just happen randomly[5].
Hello Mike,

IMO we should not portray removing the above path restrictions as
something extreme, until we have good evidence that those path
restrictions offer something positive in the cases we are
examining. Personally, I see the result of this proposal of making Sybil
attacks two times more quick (section 2.3), as an equally radical
result.

That said, I feel that this proposal is valuable and I'm not trying to
say that I don't like this proposal, or that I don't buy the
arguments. I'm trying to say that I don't know how to weight the
tradeoffs here so that I gain confidence, because I'm not sure how
people are trying to attack Tor clients right now.

The way I see it is that if we adopt this proposal:
+ We are better defended against active attacks like congestion attacks
and OOM/DoS attacks.
+ We improve network health by reducing congestion to certain guards.
- Sybil attacks can be performed two times more quickly.

IMO, we should not rush this decision for 034, given that it's a
concensus parameter change that can happen instantaneously. However, we
should do the following soon:

1) Accept that there is no single best guard topology, and fix our
codebase to work well with either one guard or two guards, so that we
are ready for when we flip the switch. Perhaps we can fix
#25753/#25705/etc. in a way that works well both now and in the
2-guard future?

2) Investigate our current prop#271 codebase and make sure that the
paragraph below will work as intended if we do this proposal.

3) Involve more peple into this (Roger, NRL, etc.) and have them think
about this, to gain more confidence.

Do you think this approach is too slow or backwards?
Post by Mike Perry
Note that for this analysis to hold, we have to ensure that nodes that
are at RESOURCELIMIT or otherwise temporarily unresponsive do not cause
us to consider other primary guards beyond than the two we have chosen.
This is accomplished by setting guard-n-primary-guards to 2 (in addition
to setting guard-n-primary-guards-to-use to 2). With this parameter
set, the proposal 271 algorithm will avoid considering more than our two
guards, unless *both* are down at once.
OK, the above paragraph is basically the juice of this proposal! I spent
all day today to investigate how this would work! The results are very
positive, but also not 100% straightforward because of the various
intricancies of prop#271.

[First of all, there is no way to simulate the above topology using the
config file because if you set NumEntryGuards=2 in your torrc, Tor will
setup 4 primary guards because of the way get_n_primary_guards()
works. So I hacked my Tor client to *have* 2 primary guards
(guard-n-primary-guards), and *use* 2 primary guards
(guard-n-primary-guards-to-use).]

The good part: This topology works exactly how the proposal wants it to
work. Because of the way primary guards work, you will have 2 primary
guards, and if one of them goes down you will always use the other
primary, instead of falling back to a third guard. That's excellent, but
it's also abusing the primary guard feature in a good way but not in the
way we were intending it to be used.

Here are the side-effects from this abuse:

- By reducing the amount of primaries from three to two, it's more
likely that all primaries can be down at a given time. Prop#271 was
written with an inherent assumption that one of the primaries will
always be reachable, because when all of them are down the code goes
into an "oh shit! bad reachability!" mode which was mainly designed
for network-down scenarios (like no-internet-land, or tunnels).

I'm refering to the UPDATE_WAITING section of prop#271 and
entry_guards_upgrade_waiting_circuits() in our codebase which takes
care of this situation. This behavior will basically delay circuits on
non-primary guards until a primary guard goes online. You can test
this behavior by blocking connections to all your primaries using
iptables. I did this today, and while Tor worked fine after some time,
there were delays and broken circuits. It's very likely we can
optimize this behavior if we want, so this is not really a blocker for
this proposal, but something we should think about and experiment
with...

We might also want to consider writing code to block clients from
skipping to lower-priority primary guards if higher-priority primary
guards are still reachable and guard-n-primary-guards-to-use > 1, so
that we can have more primary guards than we need without skipping
them when one of them goes down. That would allow us to get both the
effect of prop#291 while maintaining the original use of primary guards.

- If we set the number of primary guards to 2 and we leave
NumDirectoryGuards to 3, then NumDirectoryGuards will not work as
intended, and we will actually always use our two primary guards for
dirinfo as long as one of them is reachable. This is not a huge
problem, and might be a feature, but not the way we were intending to
use NumDirectoryGuards (see #13908 and
https://lists.torproject.org/pipermail/tor-dev/2014-May/006820.html).

Other than the above side-effects, Tor worked fine all day and only
connected to the primary guards, even when I blocked connections to one
of them. It was actually quite nice to see!

---

Hope this was useful and let me know if you have questions!
George Kadianakis
2018-04-12 16:25:39 UTC
Permalink
Post by Mike Perry
https://gitweb.torproject.org/user/mikeperry/torspec.git/tree/proposals/xxx-two-guard-nodes.txt?h=twoguards
===========================
Filename: xxx-two-guard-nodes.txt
Title: The move to two guard nodes
Author: Mike Perry
Created: 2018-03-22
Supersedes: Proposal 236
<snip>
3.1. Eliminate path restrictions entirely
If Tor decided to stop enforcing /16, node family, and also allowed the
guard node to be chosen twice in the path, then under normal conditions,
it should retain the use of its primary guard.
This approach is not as extreme as it seems on face. In fact, it is hard
to come up with arguments against removing these restrictions. Tor's
/16 restriction is of questionable utility against monitoring, and it can
be argued that since only good actors use node family, it gives influence
over path selection to bad actors in ways that are worse than the benefit
it provides to paths through good actors[10,11].
However, while removing path restrictions will solve the immediate
problem, it will not address other instances where Tor temporarily opts
use a second guard due to congestion, OOM, or failure of its primary
guard, and we're still running into bugs where this can be adversarially
controlled or just happen randomly[5].
Seems like the above paragraph is our main argument against removing
path restrictions.

Might be worth pointing out that if congestion/OOM attacks are in our
threat model against the current single guard design, then the same
adversary can force prop#291 to open a connection to the *third* guard
by first doing an OOM/congestion attack against one of your first two
guards, and then pushing you to your third guard using a path
restriction attack (#14917).

Thought that I should mention that because it might be an argument for
both moving to two guards and also lifting some path restrictions...
Roger Dingledine
2018-04-13 06:04:09 UTC
Permalink
Post by Mike Perry
The main argument for switching to two guards is that because of Tor's
path restrictions, we're already using two guards, but we're using them
in a suboptimal and potentially dangerous way.
Tor's path restrictions enforce the condition that the same node cannot
appear twice in the same circuit, nor can nodes from the same /16 subnet
or node family be used in the same circuit.
Tor's paths are also built such that the exit node is chosen first and
held fixed during guard node choice, as are the IP, HSDIR, and RPs for
onion services. This means that whenever one of these nodes happens to
be the guard[4], or be in the same /16 or node family as the guard, Tor
will build that circuit using a second "primary" guard, as per proposal
271[7].
Worse still, the choice of RP, IP, and exit can all be controlled by an
adversary (to varying degrees), enabling them to force the use of a
second guard at will.
I agree with you that we should do something about this bug, where Tor
clients will switch to a rarely used guard in some situations. Our fix
from ticket #14917 was not a good fix. More on that below in Section 3.1.
Post by Mike Perry
Not surprisingly,
the two guard adversary gets to compromise clients roughly twice as
quickly, but the timescales are still rather large even for the 10%
adversary: they only have 50% chance of success after 4 rotations, which
will take about 14 months with Tor's 3.5 month guard rotation.
Three thoughts here:

(A) You're right, 14 months doesn't sound bad here.

(B) This calculation was ignoring churn, right? That is, guards going
away before you wanted to rotate from them. So another way to phrase that
would be "once eight of your guards have gone away, you're in bad shape"?
Looking at it that way, it seems like two guards is more than twice
as scary as one, since *either* of them going away moves you one step
closer on the path. Not the end of the world, but worth noticing. And
maybe partially solvable by your "when one of your two goes away, stick
to the remaining one" design; more on that below.

(C) Similarly, we should be sure to remember the network adversary
here too. I don't know a simple way to reason about it well. Using more
guards over time could be *less* than twice as scary, because sometimes
the network paths overlap so you don't expose as much new surface area
as you might have. And using more guards over time could be *more*
than twice as scary, if the question is whether your traffic ever goes
over that one bad place, since you have an exponentially low chance to
*never* pick a guard where your traffic to/from that guard travels over
the bad place. It really depends on your location, the guard locations,
the Internet topology, and a bunch of other confusing factors.
Post by Mike Perry
Furthermore, our use of separate directory guards (and three of them)
means that we're not really changing the situation much with the
addition of another regular guard. Right now, directory guard use alone
is enough to track all Tor users across the entire world.
Shit, you're right. The guard set fingerprint issue remains right now,
because we never solved the directory guard side of it. :(
Post by Mike Perry
While the directory guard problem could be fixed[12] (and should be
fixed), it is still the case that another mechanism should be used for
the general problem of guard-vs-location management[9].
The part that freaks me out about all the designs I've seen here is the
attack where the local adversary advertises a series of local wireless
addresses, first to make you keep generating new guard contexts (similar
to forcing quick guard rotation), or second to guess-and-check whether
you've already got a guard context for some wireless address in the next
city over. Maybe it can be solved by proper UI ("we'll just delegate
the decision to the user"), but hoo boy. But that's a separate proposal
fortunately. :)
Post by Mike Perry
3.1. Eliminate path restrictions entirely
I'm increasingly a fan of this option, the more I read these threads.

Let's examine the two attacker assumptions behind two of the attacks
we're worried about.

Attack one: the client's local ISP collects coarse netflow logs, and these
logs aren't detailed enough to allow a traffic volume detection attack on
an existing long-lived TLS flow, so the connection to that first guard
is safe; but a connection to that second guard will be unusual and not
multiplexed and at exactly the time of the adversary-controlled circuit
that triggered it, so that second guard, because it is used so rarely,
is dangerous to use.

Attack two: if the client uses its guard as the first hop of its circuit
and also the adversary-requested fourth hop, then the guard can do
pairwise traffic correlation attacks on all of its circuits and realize
that these two circuits it has are really two pieces of the same circuit.

This second attack seems weird to me. One reason is because in attack
one we're brushing aside the traffic analysis as hard, whereas in attack
two we're assuming it's trivial and perfect. But the simpler reason is:
if your guard is going to participate in a traffic correlation attack
against you, then it could just as easily team up with some other relay
that the adversary picked. That is, avoiding reusing your guard on the
other end of the circuit isn't going to save you if your guard is out
to get you.

Part of why it's hard to compare these two attacks directly is because
one is a client-side-observer adversary and the other is a relay-level
adversary.

Let's look at "attack one" from a relay-level-adversary perspective:
if your first guard is bad, you're screwed already. But if that second
guard might be bad, you really want to do anything you can do to not
reach out to it even once.

And "attack two" from the client-side-observer-level-adversary
perspective: well, if the attacker is watching the *client*, there's
no visible hint that it's reusing its guard later in the path -- and
that's the whole point. But if the attacker is watching the *relay*, then
suddenly we don't have as much diversity of traffic location as we thought
we had. That is, even if your relay is nice, somebody watching the relay's
network could do the pairwise correlation attacks we described earlier.

Another part of what bothers me about attack two -- the one where the
adversary gives you your fourth hop -- is that the adversary has *other*
hops in their side of the circuit, and you don't even know about them.
What if they chose your guard for their middle hop? Or for *their*
guard? There's nothing you can do about those cases, because you can't
know that they're happening. My conclusion is that if we can't solve
significant instances of this attack, we should be wary of paying a
large price to solve only a piece of it.
Post by Mike Perry
If Tor decided to stop enforcing /16, node family, and also allowed the
guard node to be chosen twice in the path, then under normal conditions,
it should retain the use of its primary guard.
To be clear, the design I've been considering here is simply allowing
reuse between the guard hop and the final hop, when it can't be avoided. I
don't mean to allow the guard (or its family) to show up as all four
hops in the path. Is that the same as what you meant, or did you mean
something more thorough?

I think "can't be avoided" means HSDir, IP, RP -- which I note are all
onion service related circuits.

I'd like to hear more about the "cleverly crafted exit policy" attack, and
I wonder if we can't solve that differently. For example, if it's about
making you do a request to a port that only one exit relay allows, and
ha ha whoops your guard was on the same /16 as that exit relay... maybe
it's time for the dir auths to not advertise super rare ports? This was
one of the topics in the users-get-routed paper too.

One non-starter idea would be to move onion-service-related Tors to two
guards, and leave other Tors at one guard. It's a non-starter because of
course advertising which you are to your local network is no good. But
that idea gave me a different perspective on this discussion: I wonder
how much this design decision comes down to making all Tors use two
guards in order to protect the onion-service-related Tors, which are
the only ones who actually need it?
Post by Mike Perry
This approach is not as extreme as it seems on face. In fact, it is hard
to come up with arguments against removing these restrictions. Tor's
/16 restriction is of questionable utility against monitoring, and it can
be argued that since only good actors use node family, it gives influence
over path selection to bad actors in ways that are worse than the benefit
it provides to paths through good actors[10,11].
Yep.

One remaining feature for MyFamily though is that relay operators can say
"No, even though I run these eight relays, I'm not in a position to do
traffic correlation attacks on users, because I told the users to not
put me in that position." This angle of the feature is about protecting
relays, not about protecting clients.
Post by Mike Perry
However, while removing path restrictions will solve the immediate
problem, it will not address other instances where Tor temporarily opts
use a second guard due to congestion, OOM, or failure of its primary
guard, and we're still running into bugs where this can be adversarially
controlled or just happen randomly[5].
I continue to think we need to fix these. I'm glad to see that George
has been putting some energy into looking more at them. The bugs that
we don't understand are especially worrying, since it's hard to know
how bad they are. Moving to two guards might put a bit of a bandaid on
the issues, but it can't be our long-term plan for fixing them.
Post by Mike Perry
Note that for this analysis to hold, we have to ensure that nodes that
are at RESOURCELIMIT or otherwise temporarily unresponsive do not cause
us to consider other primary guards beyond than the two we have chosen.
This is accomplished by setting guard-n-primary-guards to 2 (in addition
to setting guard-n-primary-guards-to-use to 2). With this parameter
set, the proposal 271 algorithm will avoid considering more than our two
guards, unless *both* are down at once.
I like this general idea of not immediately replacing guards so long as
you have a working one. In fact, we used to do something similar back
in the day:
https://blog.torproject.org/improving-tors-anonymity-changing-guard-parameters
says (emphasis mine)
"""
Tor 0.2.3's entry guard behavior is "choose three guards, ***adding
another one if two of those three go down*** but going back to the
original ones if they come back up, and also throw out (aka rotate)
a guard 4-8 weeks after you chose it."
"""

There are still some fiddly decisions to make here. For example, as you
say we probably shouldn't replacement a guard just because we failed to
connect to one of our guards once. We might decide that it's time to add
a new second guard if the consensus tells us that one of them is down
(so we have confirmation that it isn't down for just us, it's down for
everybody). Or we might decide to wait on adding a new one even if it
really is down, because maybe it'll come back soon. But how long do
we wait? And if, while we're down to one, we encounter one of these
situations where the requested fourth hop overlaps with our remaining
guard, what do we do?

In fact, here's a hopefully useful insight that I've just realized:
you're not concerned about one guard vs two guards, you're concerned
about *transitioning* between guards. It's that moment when you're
starting to use a new guard, if the attacker can observe that you're
doing it, and especially if the attacker can make you do it, that is
vulnerable. And starting with two guards can help, in that it postpones
the time until you're forced to transition, and maybe also because if
we do it right it can make the transition less visible.

But I wonder if we're looking at this backwards, and the primary
question we should be asking is "How can we protect the transition between
guards?" Then one of the potential answers to consider is "Maybe we should
start out with two guards rather than just one." Framing it that way,
are there more options that we should consider too? For example, removing
the ability of the non-local attacker to trigger a transition? Then
there would still be visibility of a transition, but the (non-local)
attacker can't impact the timing of the transition. How much does that
solve? Need to think more.
Post by Mike Perry
3.2. No Guard-flagged nodes as exit, RP, IP, or HSDIRs
Similar to 3.1, we could instead forbid the use of Guard-flagged nodes
for the exit, IP, RP, and HSDIR positions.
This solution has two problems: First, like 3.1, it also does not handle
the case where resource exhaustion could force the use of a second
guard. Second, it requires clients to upgrade to the new behavior and
stop using Guard flagged nodes before it can be deployed.
I'm not much of a fan of this approach (it seems so inelegant!), but
I find the two problems that you identified to be unsatisfying for
ruling it out. I wonder if we can find some stronger arguments against
this approach?

Otherwise I might find myself starting to like it. :)

One stronger argument might be: "the attacker can always use Guard-flagged
nodes for other hops on its half of the circuit, and you wouldn't even
be able to know that it's doing it, so if the goal is to never have a
circuit with your guard both at your end and also reused elsewhere in
the circuit, sorry you can't achieve that goal, so stop messing stuff
up while trying to achieve what can only ever be a partial solution."
Post by Mike Perry
4. The future is confluxed
An additional benefit of using a second guard is that it enables us to
eventually use conflux[6].
I think the performance benefits are the main arguments in favor
of doing two guards. In fact, I still think that it's mainly a
performance-vs-safety tradeoff.

I agree with George that moving to two guards now so that we can maybe
do Conflux later is doing it the wrong way round. Since it's so easy
to switch to two guards, that should be one of the very easy steps in
moving to Conflux when we do, and taking the safety hit now in exchange
for the potential performance benefit later doesn't seem best.

But there's another performance argument we shouldn't forget: if you have
two guards, you're much more likely to have at least one guard that's
adequately fast. Right now some of the guards are fast (relative to
others), and some are slow (relative to others). If you get one of the
lower-end guards, your Tor performance is sad -- for months! We tried
to mitigate that issue when we switched to one guard, by raising the
required bandwidth to get the Guard flag, so there would be no truly
terrible guards. But still, some guards are more equal than others.

This issue came up especially in the context of the December/January CPU
overload attacks, where some guards were overwhelmed by circuit creation
requests, and if you had a happy guard, lucky you, but if you had a sad
guard, you might as well delete your Tor Browser and try again.

Now, in an ideal world we should come up with fixes for all of those other
issues, for example by taking the Guard flag away from relays that can't
be great guards. But in the world we live in right now, we can relieve
some of that pressure-to-be-perfect by giving people two guards.

But if we're only going on a performance vs safety basis, I don't see a
huge rush to trade off safety until we have a better handle on what sort
of performance benefits we'd actually get, and until we've compared to
other low-hanging performance fruit.

In summary:

(1) I think we should fix the bug from #14917 where the attacker can
push us off our guard just by naming our guard as the HSDir/IP/RP,
and I think we should fix it by being willing to reuse our guard when
it can't be avoided. That step will resolve some, but not all, of the
pressure about moving to two guards. Then

(2) Hopefully the above discussion has helped us move forward on the
remaining reasons for switching to two guards. To me the two biggest
questions left to resolve are (a) how best to protect the vulnerable
transition to a new guard, and if two guards is the best idea we've got
for that, and (b) how big an issue is it really that having only one
guard can sometimes give you a low-performance guard, and if two guards
is the best idea we've got for that one too.

--Roger
Mike Perry
2018-04-11 11:15:44 UTC
Permalink
Post by Roger Dingledine
Post by Mike Perry
3.1. Eliminate path restrictions entirely
I'm increasingly a fan of this option, the more I read these threads.
Let's examine the two attacker assumptions behind two of the attacks
we're worried about.
Attack one: the client's local ISP collects coarse netflow logs, and these
logs aren't detailed enough to allow a traffic volume detection attack on
an existing long-lived TLS flow, so the connection to that first guard
is safe; but a connection to that second guard will be unusual and not
multiplexed and at exactly the time of the adversary-controlled circuit
that triggered it, so that second guard, because it is used so rarely,
is dangerous to use.
Attack two: if the client uses its guard as the first hop of its circuit
and also the adversary-requested fourth hop, then the guard can do
pairwise traffic correlation attacks on all of its circuits and realize
that these two circuits it has are really two pieces of the same circuit.
This second attack seems weird to me. One reason is because in attack
one we're brushing aside the traffic analysis as hard, whereas in attack
if your guard is going to participate in a traffic correlation attack
against you, then it could just as easily team up with some other relay
that the adversary picked. That is, avoiding reusing your guard on the
other end of the circuit isn't going to save you if your guard is out
to get you.
I agree. I am not concerned about attack two. But we're not choosing
between just these two attacks.
Post by Roger Dingledine
To be clear, the design I've been considering here is simply allowing
reuse between the guard hop and the final hop, when it can't be avoided. I
don't mean to allow the guard (or its family) to show up as all four
hops in the path. Is that the same as what you meant, or did you mean
something more thorough?
By all path restrictions I mean for the last hop of the circuit and the
first (though vanguards would be simpler if we got rid of them for other
hops, too). But I do mean all restrictions, not just guard node choice.
The adversary also gets to force you to use a second network path
whenever they want via the /16 and node family restrictions. And it
happens naturally all the time.

We're not using one guard in the current Tor. We're using two, and the
second one is only used for unmultiplexed activity. That is one property
I don't like about our "let's pretend to use one guard" status quo.

The second thing I don't like is that one guard is fragile, which
enables confirmation attacks when it can be made to go down.
Post by Roger Dingledine
I think "can't be avoided" means HSDir, IP, RP -- which I note are all
onion service related circuits.
I'd like to hear more about the "cleverly crafted exit policy" attack, and
I wonder if we can't solve that differently. For example, if it's about
making you do a request to a port that only one exit relay allows, and
ha ha whoops your guard was on the same /16 as that exit relay... maybe
it's time for the dir auths to not advertise super rare ports? This was
one of the topics in the users-get-routed paper too.
Yes that is the one I was talking about.

However, another way to do this type of exit rotation attack is to cause
a client to look up a DNS name where you control the resolver, and keep
timing out on the DNS response. The client will then retry the stream
request with a new exit. The same thing can also be done by timing out
the TCP handshake to a server you control. Both of these attacks can be
done with only the ability to inject an img tag into a page.

You repeat this until an exit is chosen that is in the same /16 or
family as the guard, and then the client uses a second network path for
an unmultiplexed request at a time you control.
Post by Roger Dingledine
One non-starter idea would be to move onion-service-related Tors to two
guards, and leave other Tors at one guard. It's a non-starter because of
course advertising which you are to your local network is no good. But
that idea gave me a different perspective on this discussion: I wonder
how much this design decision comes down to making all Tors use two
guards in order to protect the onion-service-related Tors, which are
the only ones who actually need it?
Our path restrictions also cause normal exiting clients to use a second
guard for unmultiplexed activity, at adversary controlled times, or just
at periodically at random.
Post by Roger Dingledine
Post by Mike Perry
However, while removing path restrictions will solve the immediate
problem, it will not address other instances where Tor temporarily opts
use a second guard due to congestion, OOM, or failure of its primary
guard, and we're still running into bugs where this can be adversarially
controlled or just happen randomly[5].
I continue to think we need to fix these. I'm glad to see that George
has been putting some energy into looking more at them. The bugs that
we don't understand are especially worrying, since it's hard to know
how bad they are. Moving to two guards might put a bit of a bandaid on
the issues, but it can't be our long-term plan for fixing them.
We're choosing fixes for these bugs that enable an adversary to deny
service to clients at a particular guard, *without* letting those
clients move to a second guard. This enables confirmation attacks, and
these confirmation attacks can be extended to guard discovery attacks by
DoSing guards one at a time until an onion service fails.

Bringing back CREATE_FAST could help with this piece, I suppose, but it
doesn't solve OOM attacks...
Post by Roger Dingledine
Post by Mike Perry
Note that for this analysis to hold, we have to ensure that nodes that
are at RESOURCELIMIT or otherwise temporarily unresponsive do not cause
us to consider other primary guards beyond than the two we have chosen.
This is accomplished by setting guard-n-primary-guards to 2 (in addition
to setting guard-n-primary-guards-to-use to 2). With this parameter
set, the proposal 271 algorithm will avoid considering more than our two
guards, unless *both* are down at once.
I like this general idea of not immediately replacing guards so long as
you have a working one. In fact, we used to do something similar back
https://blog.torproject.org/improving-tors-anonymity-changing-guard-parameters
says (emphasis mine)
"""
Tor 0.2.3's entry guard behavior is "choose three guards, ***adding
another one if two of those three go down*** but going back to the
original ones if they come back up, and also throw out (aka rotate)
a guard 4-8 weeks after you chose it."
"""
There are still some fiddly decisions to make here. For example, as you
say we probably shouldn't replacement a guard just because we failed to
connect to one of our guards once. We might decide that it's time to add
a new second guard if the consensus tells us that one of them is down
(so we have confirmation that it isn't down for just us, it's down for
everybody). Or we might decide to wait on adding a new one even if it
really is down, because maybe it'll come back soon. But how long do
we wait? And if, while we're down to one, we encounter one of these
situations where the requested fourth hop overlaps with our remaining
guard, what do we do?
If I were to drop everything to build the Tor I think should exist, I
would do the following:

1. Use two guards, replacing them only when both are unreachable, or
when one leaves the consensus.
2. Make path restrictions not as strict (for cases like the one above).
3. Use conflux (which also needs less strict/no path restrictions)
4. Build it on QUIC.

I would do them in that order because I think we get the most benefit
from #1, and we get some benefit from #2 still (as you point out above).

You keep focusing on the performance aspects of conflux, but that is not
the argument I am making. My arguments for conflux in Section 4 are
about resilience to congestion, downtime, circuit killing, and DoS, as
well as traffic analysis resistance. I see the performance benefits as
secondary.

(I also think the best arguments for QUIC are also in the reliability
direction, because fixed queues means no adversary provoked OOMing.)
Post by Roger Dingledine
you're not concerned about one guard vs two guards, you're concerned
about *transitioning* between guards. It's that moment when you're
starting to use a new guard, if the attacker can observe that you're
doing it, and especially if the attacker can make you do it, that is
vulnerable. And starting with two guards can help, in that it postpones
the time until you're forced to transition, and maybe also because if
we do it right it can make the transition less visible.
The transition aspect is a big piece of it, but I think we're also
running into a fragility problem, which makes the transition signal very
loud in many cases.
Post by Roger Dingledine
But I wonder if we're looking at this backwards, and the primary
question we should be asking is "How can we protect the transition between
guards?" Then one of the potential answers to consider is "Maybe we should
start out with two guards rather than just one." Framing it that way,
are there more options that we should consider too? For example, removing
the ability of the non-local attacker to trigger a transition? Then
there would still be visibility of a transition, but the (non-local)
attacker can't impact the timing of the transition. How much does that
solve? Need to think more.
One guard is inherently more fragile than two, and no matter what we do,
it means that there will be a risk of attacks that can confirm guard
choice, because the downtime during this transition can never be hidden
without at least some redundancy.
Post by Roger Dingledine
(1) I think we should fix the bug from #14917 where the attacker can
push us off our guard just by naming our guard as the HSDir/IP/RP,
and I think we should fix it by being willing to reuse our guard when
it can't be avoided. That step will resolve some, but not all, of the
pressure about moving to two guards. Then
Without removing all path restrictions that apply to first and last hop,
we're still actually using two guards, and using them at times that the
adversary gets to control if they want, or just randomly otherwise.
Post by Roger Dingledine
(2) Hopefully the above discussion has helped us move forward on the
remaining reasons for switching to two guards. To me the two biggest
questions left to resolve are (a) how best to protect the vulnerable
transition to a new guard, and if two guards is the best idea we've got
for that, and (b) how big an issue is it really that having only one
guard can sometimes give you a low-performance guard, and if two guards
is the best idea we've got for that one too.
Transitions will always be noisy with one guard, because it is fragile
to DoS, congestion, OOM, circuit failure, onionskin overload, etc etc
etc. How can you provide resiliency under arbitrary and partial failure
without any redundancy?
--
Mike Perry
Roger Dingledine
2018-04-18 08:27:51 UTC
Permalink
Post by Mike Perry
Post by Roger Dingledine
To be clear, the design I've been considering here is simply allowing
reuse between the guard hop and the final hop, when it can't be avoided. I
don't mean to allow the guard (or its family) to show up as all four
hops in the path. Is that the same as what you meant, or did you mean
something more thorough?
By all path restrictions I mean for the last hop of the circuit and the
first (though vanguards would be simpler if we got rid of them for other
hops, too).
Can you lay out for us the things to think about in the Vanguard design?
Last I checked there were quite a few Vanguard design variants, ranging
from "two vanguards per guard, tree style" to some sort of mesh.

In particular, it would be convenient if there is a frontrunner design
that really would benefit from relaxing many path restrictions, and a
frontrunner design that is not so tied together to the path restriction
question.
Post by Mike Perry
But I do mean all restrictions, not just guard node choice.
The adversary also gets to force you to use a second network path
whenever they want via the /16 and node family restrictions.
Can you give us a specific example here, for this phrase "network
path"? When you say "second network path" are you thinking in the
Vanguard world?
Post by Mike Perry
We're not using one guard in the current Tor. We're using two, and the
second one is only used for unmultiplexed activity. That is one property
I don't like about our "let's pretend to use one guard" status quo.
Right, I agree.
Post by Mike Perry
Post by Roger Dingledine
I'd like to hear more about the "cleverly crafted exit policy" attack
another way to do this type of exit rotation attack is to cause
a client to look up a DNS name where you control the resolver, and keep
timing out on the DNS response. The client will then retry the stream
request with a new exit. The same thing can also be done by timing out
the TCP handshake to a server you control. Both of these attacks can be
done with only the ability to inject an img tag into a page.
You repeat this until an exit is chosen that is in the same /16 or
family as the guard, and then the client uses a second network path for
an unmultiplexed request at a time you control.
Hm! Yes, this is a yucky one. (I don't think just an img tag would be
enough, because Tor will try a few circuits and then give up. You'd need
some sort of javascript or refresh chain or the like that generates new
addresses and tries them in succession. But that's totally feasible.)

This one is also yucky because we could also imagine a different way to
pick your path, where when you're selecting your exit, you avoid choosing
exits which would conflict with your guard, and thus you'll never be
pushed off of your guard. But then the destination website can do this
same attack over time and notice which exit you never try to use. So
this is a case where to blend in best, we *need* to be willing to use
all of the potential exits.

But since normal exit circuits are three hops, if we simply relax the
path restrictions, we could be making a circuit of the form "A - B - A",
which would not only stand out as weird to B, but actually right now a
relay in B's position will refuse such a circuit. Bad news all around.

The three fixes that come to mind are

(A) "Have two guards": so you can pick any exit you like, and then just
use the guard that doesn't conflict with the exit you picked.

(B) "Add a bonus hop when needed": First relax the /16 and family
restrictions, so the remaining issue is reuse of your guard. Then if
you find that you just chose your guard as your exit, insert an extra
hop in the middle of that circuit.

(C) "Exits can't be Guards": First relax the /16 and family restrictions,
so the remaining issue is reuse of your guard. Then notice that due
to exit scarcity, guards aren't actually used in the exit position
anyway. Then enforce that rule (so they can't be in the future either).

All three of these choices have downsides. But all three of them look
like improvements over the current situation -- because of how crappy
the current situation is.

(Rejected option (D): "Just start allowing it": Relax the /16 and
family restrictions, and also relax the rule where relays refuse a
circuit that goes right back where it came from. Giving the middle node
that much information about the circuit just wigs me out.)

Also, notice that I think Mike's proposed design will turn out to be some
combination of "A" and also something like "B" or "C", because even if
you start with two guards, if you don't add a new guard right when your
first guard goes down, you might find yourself in the situation where
you have one working guard, and you pick it as your exit, and now you
need to do *something*.
Post by Mike Perry
Our path restrictions also cause normal exiting clients to use a second
guard for unmultiplexed activity, at adversary controlled times, or just
at periodically at random.
Just to make sure I understand: at least on the current network,
that's because of the /16 rule and the family rule, and not because of
the "if the exit you picked turns out to be your guard too, move to a
different guard" rule, because exits aren't normally used for guards on
our current network?

On more examination though, that's not something to rely on with our
current design, since I bet there are weird edge cases like a relay
loses its Guard flag, but it's still your Guard so you keep using it
(depending on the advice del año from #17773), but now the weightings
let you pick it for your Exit, and oops.

Another problematic example would be a relay that you picked as your
Guard, and later it opened up its exit policy and became an Exit.

So if I wanted to try to flesh out my "Then enforce that rule" approach
above, we would need to (1) Have dir auths take away the Guard flag from
relays that can be used as Exits, and (2) Make sure that clients know
that if their guards lose the Guard flag, they should treat them as being
no longer guardworthy. I think we're doing that second one right now,
based on my latest reading of #17773, so this would actually be a pretty
easy change. But still, it's not exactly elegant.
Post by Mike Perry
Post by Roger Dingledine
Post by Mike Perry
However, while removing path restrictions will solve the immediate
problem, it will not address other instances where Tor temporarily opts
use a second guard due to congestion, OOM, or failure of its primary
guard, and we're still running into bugs where this can be adversarially
controlled or just happen randomly[5].
I continue to think we need to fix these. I'm glad to see that George
has been putting some energy into looking more at them. The bugs that
we don't understand are especially worrying, since it's hard to know
how bad they are. Moving to two guards might put a bit of a bandaid on
the issues, but it can't be our long-term plan for fixing them.
We're choosing fixes for these bugs that enable an adversary to deny
service to clients at a particular guard, *without* letting those
clients move to a second guard. This enables confirmation attacks, and
these confirmation attacks can be extended to guard discovery attacks by
DoSing guards one at a time until an onion service fails.
I would find non-onion-service examples more compelling here, since I
want to avoid falling back into the "well, onion services need special
treatment to be safe, so we have to choose between hurting normal clients
and hurting onion services" trap.

How is this for an alternative scenario to be considering: the attacking
website gives the Tor Browser user some page content that causes the
browser to initiate periodic events. Then it starts congesting guards
one at a time until the events stop arriving.

Are those two scenarios basically equivalent in terms of the confirmation
attacks you are worrying about? I hope yes, and now I can stop getting
distracted by wondering if going to this effort is worth it only to
protect onion services? :)
Post by Mike Perry
You keep focusing on the performance aspects of conflux, but that is not
the argument I am making. My arguments for conflux in Section 4 are
about resilience to congestion, downtime, circuit killing, and DoS, as
well as traffic analysis resistance. I see the performance benefits as
secondary.
I like conflux in theory, but somebody needs to do the other 90%
of the work to make it a concrete thing that we can consider.

I continue to think "Tor should switch to two guards, because one day
we should design and deploy conflux" is a terrible reason to switch to
two guards now.

So I didn't mean to mix the conflux discussion and the performance
discussion. I meant to mostly ignore the conflux discussion (because it
is a future proposal, not this one), while also making sure that we don't
forget the potential performance benefits of having two guards in general.
Post by Mike Perry
Post by Roger Dingledine
But I wonder if we're looking at this backwards, and the primary
question we should be asking is "How can we protect the transition between
guards?" Then one of the potential answers to consider is "Maybe we should
start out with two guards rather than just one." Framing it that way,
are there more options that we should consider too? For example, removing
the ability of the non-local attacker to trigger a transition? Then
there would still be visibility of a transition, but the (non-local)
attacker can't impact the timing of the transition. How much does that
solve? Need to think more.
One guard is inherently more fragile than two, and no matter what we do,
it means that there will be a risk of attacks that can confirm guard
choice, because the downtime during this transition can never be hidden
without at least some redundancy.
How's this for another option: clients have two guards, but they have
a first guard and a backup guard. They do the traffic padding to both
of them, to ensure continuous netflow sessions in their local ISP's
logs. But they try to send most of their traffic over the first guard,
thus avoiding most of the "increased surface area" concerns about using
two guards at once. And we try to reduce the frequency of situations where
they can't use their first guard. But in the "transition" situations
that we decide we need to keep, they use their backup guard, and it's
already available and ready and that netflow session is already active
in the eyes of their ISP.

This approach isn't conflux (yet), but it's not incompatible with later
changing things so we do conflux.

It also doesn't get us the lower variance of performance that having
two equally used guards would get us. But I am ok with that for now,
at least until somebody has done some performance analysis to show that
we're really suffering now and we would stop suffering then.

It adds load onto the relays, by almost doubling the number of sockets
used by guards for clients, and also by adding more bandwidth load from
the padding cells to/from the backup guard. (How much bandwidth load is
this, per client?)

And it doesn't actually provide as much "real" cover traffic onto the
backup guard in most situations, so somebody who can look more thoroughly
at the traffic flows will still be able to distinguish a transition
event from the first to the backup. Maybe that's a problem? Or maybe
the netflow level adversary that we declared in the threat model can't
do that, and a real attacker would be able to see the traffic details
anyway, so we're fine^W^Wno worse off than before?

Assuming this design meets all of our goals, let's examine two variants
of it to make sure we understand what we're actually trading off. In
particular, consider a design where we maintain (and pad) these two
connections, vs a design where we maintain a connection to our first
guard and then launch a connection to the backup guard on demand. The
downside of keeping the backup connection open is the extra network-wide
socket and bandwidth load on relays, while the downsides of launching
a connection on demand are the risk that a local netflow-level ISP can
see when we transition to using the backup guard, plus the risk that a
remote attacker who can cripple guards will be able to notice the delay
in the "launch on demand case" but could not distinguish the delay in
the "two connections" case.

That second risk doesn't seem so scary to me, since local handshakes
should be a small fraction of the overall time it takes to build and use
a new circuit. But above you say "the downtime during this transition can
never be hidden without at least some redundancy", so if you think this
risk is scary, I'd like to hear more details about why. (Maybe the design
you were concerned about was one where we just freeze in place and fail
when we don't want to use our first guard? I agree, that's a bad design,
and we can do better, for example by "be willing to use the second guard".)

Whereas that first risk does seem plausible to me -- worth trying to
reduce. I think we should start by enumerating as many scary scenarios
as we can (where scary means "currently we would shift away from our
first guard"), and then fix as many of them as we can. Then we should
look at the remaining scenarios where we would switch over to using our
backup guard (like, when our first guard isn't able to build new circuits
for us), and decide if the cost of the additional load on the network is
worth hiding that transition timing from a netflow-level client-side-ISP
adversary. I can see the answer being "yes, it's worth it", but I think it
will be useful to have a good handle on which transition scenarios remain.

--Roger
Mike Perry
2018-04-18 23:31:26 UTC
Permalink
Post by Roger Dingledine
Post by Mike Perry
Post by Roger Dingledine
To be clear, the design I've been considering here is simply allowing
reuse between the guard hop and the final hop, when it can't be avoided. I
don't mean to allow the guard (or its family) to show up as all four
hops in the path. Is that the same as what you meant, or did you mean
something more thorough?
By all path restrictions I mean for the last hop of the circuit and the
first (though vanguards would be simpler if we got rid of them for other
hops, too).
Can you lay out for us the things to think about in the Vanguard design?
Last I checked there were quite a few Vanguard design variants, ranging
from "two vanguards per guard, tree style" to some sort of mesh.
In particular, it would be convenient if there is a frontrunner design
that really would benefit from relaxing many path restrictions, and a
frontrunner design that is not so tied together to the path restriction
question.
There are two frontrunner forms. One has no path restrictions, the other
would try to perform restriction checks on each layer to ensure that it
is valid and doesn't leak info about other layers or prevent circuit
creation.

They are otherwise the same. Both are mesh; both are tunable in the
number of guards and rotation times in each layer.

I am leaning towards "no restrictions" for vanguards for 0.3.4 because
it is simpler, and it did not strike me that the arguments in their
favor justified trying to implement them quickly in a way that might
cause reachability or path influence risks.
Post by Roger Dingledine
Post by Mike Perry
But I do mean all restrictions, not just guard node choice.
The adversary also gets to force you to use a second network path
whenever they want via the /16 and node family restrictions.
Can you give us a specific example here, for this phrase "network
path"? When you say "second network path" are you thinking in the
Vanguard world?
Second path to entry into the Tor network (and a second guard),
regardless of vanguards.
Post by Roger Dingledine
Post by Mike Perry
Post by Roger Dingledine
I'd like to hear more about the "cleverly crafted exit policy" attack
another way to do this type of exit rotation attack is to cause
a client to look up a DNS name where you control the resolver, and keep
timing out on the DNS response. The client will then retry the stream
request with a new exit. The same thing can also be done by timing out
the TCP handshake to a server you control. Both of these attacks can be
done with only the ability to inject an img tag into a page.
You repeat this until an exit is chosen that is in the same /16 or
family as the guard, and then the client uses a second network path for
an unmultiplexed request at a time you control.
The three fixes that come to mind are
(A) "Have two guards": so you can pick any exit you like, and then just
use the guard that doesn't conflict with the exit you picked.
(B) "Add a bonus hop when needed": First relax the /16 and family
restrictions, so the remaining issue is reuse of your guard. Then if
you find that you just chose your guard as your exit, insert an extra
hop in the middle of that circuit.
(C) "Exits can't be Guards": First relax the /16 and family restrictions,
so the remaining issue is reuse of your guard. Then notice that due
to exit scarcity, guards aren't actually used in the exit position
anyway. Then enforce that rule (so they can't be in the future either).
All three of these choices have downsides. But all three of them look
like improvements over the current situation -- because of how crappy
the current situation is.
(Rejected option (D): "Just start allowing it": Relax the /16 and
family restrictions, and also relax the rule where relays refuse a
circuit that goes right back where it came from. Giving the middle node
that much information about the circuit just wigs me out.)
Also, notice that I think Mike's proposed design will turn out to be some
combination of "A" and also something like "B" or "C", because even if
you start with two guards, if you don't add a new guard right when your
first guard goes down, you might find yourself in the situation where
you have one working guard, and you pick it as your exit, and now you
need to do *something*.
The one-guard-down case does impact things. But even when this does
happen (which should be rare), it should only be true for a small window
of time before the consensus updates.

The "down" guard should either be temporarily overloaded, or fully down
and kicked off the consensus. I think we should only add a new guard
when one falls out of the consensus, or both are unreachable/unusable.

This is why I think it is OK to take an incremental approach and
start with A, and roll out things like B and C and other restriction
relaxations.

During these edge cases, the most important property that we should
strive to preserve is overall reachability. I don't like situations
where the adversary gains information by certain nodes being overloaded
or down. In my view, trying to make smart decisions to minimize exposure
to more nodes is secondary to overall reachability. (Overall
reachability allows a *non-network* adversary to gain information about
how clients are using our network. That strikes me as a lower resource,
more dangerous attack than the unknown risk of possible partial network
observers. In other words, I believe we made the right short-term call
in #14917 in terms of preserving reachability.)
Post by Roger Dingledine
Post by Mike Perry
Our path restrictions also cause normal exiting clients to use a second
guard for unmultiplexed activity, at adversary controlled times, or just
at periodically at random.
Just to make sure I understand: at least on the current network,
that's because of the /16 rule and the family rule, and not because of
the "if the exit you picked turns out to be your guard too, move to a
different guard" rule, because exits aren't normally used for guards on
our current network?
On more examination though, that's not something to rely on with our
current design, since I bet there are weird edge cases like a relay
loses its Guard flag, but it's still your Guard so you keep using it
(depending on the advice from #17773), but now the weightings
let you pick it for your Exit, and oops.
Another problematic example would be a relay that you picked as your
Guard, and later it opened up its exit policy and became an Exit.
I am in favor of preventing guards from being exits. Intuitively, it
means less "one stop shop" surveillance points to see both entry and
exit traffic. It also makes flag-based load balancing equations much
simpler, and makes it easier to account for padding overhead.
Post by Roger Dingledine
So if I wanted to try to flesh out my "Then enforce that rule" approach
above, we would need to (1) Have dir auths take away the Guard flag from
relays that can be used as Exits, and (2) Make sure that clients know
that if their guards lose the Guard flag, they should treat them as being
no longer guardworthy. I think we're doing that second one right now,
based on my latest reading of #17773, so this would actually be a pretty
easy change. But still, it's not exactly elegant.
In the world where we keep path restrictions, these would be my rules:
1. Two equal guards, chosen from not the same /16 or family
2. Choose each vanguard layer members such that each layer has at least
one node from a unique /16 and family.
3. Build paths in a strict order, from last hop towards guard. If you
can't build a path with this ordering, start over with a sampled guard.
(With rule #1 and #2, this should be very rare and should mean that
a guard is marked down locally but still marked up in the consensus.)
4. No guards as exits (Not needed but do it anyway for other reasons).


Then under these rules, you decide to use a new primary guard, if:
0. When a guard leaves the consensus, replace it with a new primary
guard.
1. Temporarily pick a new guard when your two primaries are locally down
or unusable (ie step #3 above fails).
Post by Roger Dingledine
Post by Mike Perry
Post by Roger Dingledine
Post by Mike Perry
However, while removing path restrictions will solve the immediate
problem, it will not address other instances where Tor temporarily opts
use a second guard due to congestion, OOM, or failure of its primary
guard, and we're still running into bugs where this can be adversarially
controlled or just happen randomly[5].
I continue to think we need to fix these. I'm glad to see that George
has been putting some energy into looking more at them. The bugs that
we don't understand are especially worrying, since it's hard to know
how bad they are. Moving to two guards might put a bit of a bandaid on
the issues, but it can't be our long-term plan for fixing them.
We're choosing fixes for these bugs that enable an adversary to deny
service to clients at a particular guard, *without* letting those
clients move to a second guard. This enables confirmation attacks, and
these confirmation attacks can be extended to guard discovery attacks by
DoSing guards one at a time until an onion service fails.
I would find non-onion-service examples more compelling here, since I
want to avoid falling back into the "well, onion services need special
treatment to be safe, so we have to choose between hurting normal clients
and hurting onion services" trap.
How is this for an alternative scenario to be considering: the attacking
website gives the Tor Browser user some page content that causes the
browser to initiate periodic events. Then it starts congesting guards
one at a time until the events stop arriving.
Are those two scenarios basically equivalent in terms of the confirmation
attacks you are worrying about? I hope yes, and now I can stop getting
distracted by wondering if going to this effort is worth it only to
protect onion services? :)
Yes.
Post by Roger Dingledine
Post by Mike Perry
Post by Roger Dingledine
But I wonder if we're looking at this backwards, and the primary
question we should be asking is "How can we protect the transition between
guards?" Then one of the potential answers to consider is "Maybe we should
start out with two guards rather than just one." Framing it that way,
are there more options that we should consider too? For example, removing
the ability of the non-local attacker to trigger a transition? Then
there would still be visibility of a transition, but the (non-local)
attacker can't impact the timing of the transition. How much does that
solve? Need to think more.
One guard is inherently more fragile than two, and no matter what we do,
it means that there will be a risk of attacks that can confirm guard
choice, because the downtime during this transition can never be hidden
without at least some redundancy.
How's this for another option: clients have two guards, but they have
a first guard and a backup guard. They do the traffic padding to both
of them, to ensure continuous netflow sessions in their local ISP's
logs. But they try to send most of their traffic over the first guard,
thus avoiding most of the "increased surface area" concerns about using
two guards at once. And we try to reduce the frequency of situations where
they can't use their first guard. But in the "transition" situations
that we decide we need to keep, they use their backup guard, and it's
already available and ready and that netflow session is already active
in the eyes of their ISP.
This approach isn't conflux (yet), but it's not incompatible with later
changing things so we do conflux.
It also doesn't get us the lower variance of performance that having
two equally used guards would get us. But I am ok with that for now,
at least until somebody has done some performance analysis to show that
we're really suffering now and we would stop suffering then.
FYI, we actually do have one form of this info in figure 10 of
https://www.freehaven.net/anonbib/cache/wpes12-cogs.pdf

We get the largest performance gains from going from one guard to two,
in terms of reducing the variance (flatness) of that CDF.

Qualitatively, this means way fewer users who try Tor and experience a
very slow Tor, telling their friends that it is too slow and should not
be used. This is a real thing. Web UX folks have found that it happens
with perf variances in the sub-second range with websites.
Post by Roger Dingledine
It adds load onto the relays, by almost doubling the number of sockets
used by guards for clients, and also by adding more bandwidth load from
the padding cells to/from the backup guard. (How much bandwidth load is
this, per client?)
And it doesn't actually provide as much "real" cover traffic onto the
backup guard in most situations, so somebody who can look more thoroughly
at the traffic flows will still be able to distinguish a transition
event from the first to the backup. Maybe that's a problem? Or maybe
the netflow level adversary that we declared in the threat model can't
do that, and a real attacker would be able to see the traffic details
anyway, so we're fine^W^Wno worse off than before?
There are a couple things here that make me think we may still be worse
off.

1. The netflow padding is not designed to simulate client traffic. It is
designed to aggregate client traffic together over time in the
adversary's logs. Instead of seeing a discrete "520KB xfer in this 15
second period, 80KB in that one, and 2300KB in that one, and then
silence for 25 minutes", the adversary records "2900KB traffic total in
this half hour". For this aggregation to help, there really needs to be
other traffic during that half hour. This is why I keep saying that more
concurrent activity is better than only using the second guard
sometimes. (WTF-PAD could do things like you describe above, but we need
to program histograms+state machines for that).

2. Detection of when to switch to this second guard seems complicated
and error prone, and if it results in unavailability, it is strictly
worse. If it switches to the second guard at the first sign of
RESOURCELIMIT and path selection issues, well, then you're adding a lot
of complexity for how much benefit (and also complexity that could be
manipulated by the adversary).
Post by Roger Dingledine
Whereas that first risk does seem plausible to me -- worth trying to
reduce. I think we should start by enumerating as many scary scenarios
as we can (where scary means "currently we would shift away from our
first guard"), and then fix as many of them as we can. Then we should
look at the remaining scenarios where we would switch over to using our
backup guard (like, when our first guard isn't able to build new circuits
for us), and decide if the cost of the additional load on the network is
worth hiding that transition timing from a netflow-level client-side-ISP
adversary. I can see the answer being "yes, it's worth it", but I think it
will be useful to have a good handle on which transition scenarios remain.
Well, "fixing" the largest, most frequent, and adversary controlled
classes of these requires:

1. Removing path restrictions.
2. Recognizing DoS attacks and differentiating them from bad network
conditions.

#2 is what worries me. Any solution to #2 that is agile enough to avoid
downtime strikes me as no better than "switch to guard #2 with
probability 1/2 after a RESOURCELIMIT or any other circuit failure"
(which is what the code would do today with two equal guards), and a
hell of a lot more complex (with risk of a downtime signal or adversary
path influence if we get it wrong).
--
Mike Perry
Loading...