[tor-dev] Proposal: The move to two guard nodes

Added as proposal 291!

George Kadianakis

2018-04-10 15:33:58 UTC

Post by Mike Perry
https://gitweb.torproject.org/user/mikeperry/torspec.git/tree/proposals/xxx-two-guard-nodes.txt?h=twoguards
===========================
Filename: xxx-two-guard-nodes.txt
Title: The move to two guard nodes
Author: Mike Perry
Created: 2018-03-22
Supersedes: Proposal 236
<snip>
3.1. Eliminate path restrictions entirely
If Tor decided to stop enforcing /16, node family, and also allowed the
guard node to be chosen twice in the path, then under normal conditions,
it should retain the use of its primary guard.
This approach is not as extreme as it seems on face. In fact, it is hard
to come up with arguments against removing these restrictions. Tor's
/16 restriction is of questionable utility against monitoring, and it can
be argued that since only good actors use node family, it gives influence
over path selection to bad actors in ways that are worse than the benefit
it provides to paths through good actors[10,11].
However, while removing path restrictions will solve the immediate
problem, it will not address other instances where Tor temporarily opts
use a second guard due to congestion, OOM, or failure of its primary
guard, and we're still running into bugs where this can be adversarially
controlled or just happen randomly[5].

Hello Mike,

IMO we should not portray removing the above path restrictions as
something extreme, until we have good evidence that those path
restrictions offer something positive in the cases we are
examining. Personally, I see the result of this proposal of making Sybil
attacks two times more quick (section 2.3), as an equally radical
result.

That said, I feel that this proposal is valuable and I'm not trying to
say that I don't like this proposal, or that I don't buy the
arguments. I'm trying to say that I don't know how to weight the
tradeoffs here so that I gain confidence, because I'm not sure how
people are trying to attack Tor clients right now.

The way I see it is that if we adopt this proposal:
+ We are better defended against active attacks like congestion attacks
and OOM/DoS attacks.
+ We improve network health by reducing congestion to certain guards.
- Sybil attacks can be performed two times more quickly.

IMO, we should not rush this decision for 034, given that it's a
concensus parameter change that can happen instantaneously. However, we
should do the following soon:

1) Accept that there is no single best guard topology, and fix our
codebase to work well with either one guard or two guards, so that we
are ready for when we flip the switch. Perhaps we can fix
#25753/#25705/etc. in a way that works well both now and in the
2-guard future?

2) Investigate our current prop#271 codebase and make sure that the
paragraph below will work as intended if we do this proposal.

3) Involve more peple into this (Roger, NRL, etc.) and have them think
about this, to gain more confidence.

Do you think this approach is too slow or backwards?

Post by Mike Perry
Note that for this analysis to hold, we have to ensure that nodes that
are at RESOURCELIMIT or otherwise temporarily unresponsive do not cause
us to consider other primary guards beyond than the two we have chosen.
This is accomplished by setting guard-n-primary-guards to 2 (in addition
to setting guard-n-primary-guards-to-use to 2). With this parameter
set, the proposal 271 algorithm will avoid considering more than our two
guards, unless *both* are down at once.

OK, the above paragraph is basically the juice of this proposal! I spent
all day today to investigate how this would work! The results are very
positive, but also not 100% straightforward because of the various
intricancies of prop#271.

[First of all, there is no way to simulate the above topology using the
config file because if you set NumEntryGuards=2 in your torrc, Tor will
setup 4 primary guards because of the way get_n_primary_guards()
works. So I hacked my Tor client to *have* 2 primary guards
(guard-n-primary-guards), and *use* 2 primary guards
(guard-n-primary-guards-to-use).]

The good part: This topology works exactly how the proposal wants it to
work. Because of the way primary guards work, you will have 2 primary
guards, and if one of them goes down you will always use the other
primary, instead of falling back to a third guard. That's excellent, but
it's also abusing the primary guard feature in a good way but not in the
way we were intending it to be used.

Here are the side-effects from this abuse:

- By reducing the amount of primaries from three to two, it's more
likely that all primaries can be down at a given time. Prop#271 was
written with an inherent assumption that one of the primaries will
always be reachable, because when all of them are down the code goes
into an "oh shit! bad reachability!" mode which was mainly designed
for network-down scenarios (like no-internet-land, or tunnels).

I'm refering to the UPDATE_WAITING section of prop#271 and
entry_guards_upgrade_waiting_circuits() in our codebase which takes
care of this situation. This behavior will basically delay circuits on
non-primary guards until a primary guard goes online. You can test
this behavior by blocking connections to all your primaries using
iptables. I did this today, and while Tor worked fine after some time,
there were delays and broken circuits. It's very likely we can
optimize this behavior if we want, so this is not really a blocker for
this proposal, but something we should think about and experiment
with...

We might also want to consider writing code to block clients from
skipping to lower-priority primary guards if higher-priority primary
guards are still reachable and guard-n-primary-guards-to-use > 1, so
that we can have more primary guards than we need without skipping
them when one of them goes down. That would allow us to get both the
effect of prop#291 while maintaining the original use of primary guards.

- If we set the number of primary guards to 2 and we leave
NumDirectoryGuards to 3, then NumDirectoryGuards will not work as
intended, and we will actually always use our two primary guards for
dirinfo as long as one of them is reachable. This is not a huge
problem, and might be a feature, but not the way we were intending to
use NumDirectoryGuards (see #13908 and
https://lists.torproject.org/pipermail/tor-dev/2014-May/006820.html).

Other than the above side-effects, Tor worked fine all day and only
connected to the primary guards, even when I blocked connections to one
of them. It was actually quite nice to see!

---

Hope this was useful and let me know if you have questions!

George Kadianakis

2018-04-12 16:25:39 UTC

Seems like the above paragraph is our main argument against removing
path restrictions.

Might be worth pointing out that if congestion/OOM attacks are in our
threat model against the current single guard design, then the same
adversary can force prop#291 to open a connection to the *third* guard
by first doing an OOM/congestion attack against one of your first two
guards, and then pushing you to your third guard using a path
restriction attack (#14917).

Thought that I should mention that because it might be an argument for
both moving to two guards and also lifting some path restrictions...

Roger Dingledine

2018-04-13 06:04:09 UTC

Post by Mike Perry
The main argument for switching to two guards is that because of Tor's
path restrictions, we're already using two guards, but we're using them
in a suboptimal and potentially dangerous way.
Tor's path restrictions enforce the condition that the same node cannot
appear twice in the same circuit, nor can nodes from the same /16 subnet
or node family be used in the same circuit.
Tor's paths are also built such that the exit node is chosen first and
held fixed during guard node choice, as are the IP, HSDIR, and RPs for
onion services. This means that whenever one of these nodes happens to
be the guard[4], or be in the same /16 or node family as the guard, Tor
will build that circuit using a second "primary" guard, as per proposal
271[7].
Worse still, the choice of RP, IP, and exit can all be controlled by an
adversary (to varying degrees), enabling them to force the use of a
second guard at will.

I agree with you that we should do something about this bug, where Tor
clients will switch to a rarely used guard in some situations. Our fix
from ticket #14917 was not a good fix. More on that below in Section 3.1.

Post by Mike Perry
Not surprisingly,
the two guard adversary gets to compromise clients roughly twice as
quickly, but the timescales are still rather large even for the 10%
adversary: they only have 50% chance of success after 4 rotations, which
will take about 14 months with Tor's 3.5 month guard rotation.

Three thoughts here:

(A) You're right, 14 months doesn't sound bad here.

(B) This calculation was ignoring churn, right? That is, guards going
away before you wanted to rotate from them. So another way to phrase that
would be "once eight of your guards have gone away, you're in bad shape"?
Looking at it that way, it seems like two guards is more than twice
as scary as one, since *either* of them going away moves you one step
closer on the path. Not the end of the world, but worth noticing. And
maybe partially solvable by your "when one of your two goes away, stick
to the remaining one" design; more on that below.

(C) Similarly, we should be sure to remember the network adversary
here too. I don't know a simple way to reason about it well. Using more
guards over time could be *less* than twice as scary, because sometimes
the network paths overlap so you don't expose as much new surface area
as you might have. And using more guards over time could be *more*
than twice as scary, if the question is whether your traffic ever goes
over that one bad place, since you have an exponentially low chance to
*never* pick a guard where your traffic to/from that guard travels over
the bad place. It really depends on your location, the guard locations,
the Internet topology, and a bunch of other confusing factors.

Post by Mike Perry
Furthermore, our use of separate directory guards (and three of them)
means that we're not really changing the situation much with the
addition of another regular guard. Right now, directory guard use alone
is enough to track all Tor users across the entire world.

Shit, you're right. The guard set fingerprint issue remains right now,
because we never solved the directory guard side of it. :(

Post by Mike Perry
While the directory guard problem could be fixed[12] (and should be
fixed), it is still the case that another mechanism should be used for
the general problem of guard-vs-location management[9].

The part that freaks me out about all the designs I've seen here is the
attack where the local adversary advertises a series of local wireless
addresses, first to make you keep generating new guard contexts (similar
to forcing quick guard rotation), or second to guess-and-check whether
you've already got a guard context for some wireless address in the next
city over. Maybe it can be solved by proper UI ("we'll just delegate
the decision to the user"), but hoo boy. But that's a separate proposal
fortunately. :)

Post by Mike Perry
3.1. Eliminate path restrictions entirely

I'm increasingly a fan of this option, the more I read these threads.

Let's examine the two attacker assumptions behind two of the attacks
we're worried about.

Attack one: the client's local ISP collects coarse netflow logs, and these
logs aren't detailed enough to allow a traffic volume detection attack on
an existing long-lived TLS flow, so the connection to that first guard
is safe; but a connection to that second guard will be unusual and not
multiplexed and at exactly the time of the adversary-controlled circuit
that triggered it, so that second guard, because it is used so rarely,
is dangerous to use.

Attack two: if the client uses its guard as the first hop of its circuit
and also the adversary-requested fourth hop, then the guard can do
pairwise traffic correlation attacks on all of its circuits and realize
that these two circuits it has are really two pieces of the same circuit.

This second attack seems weird to me. One reason is because in attack
one we're brushing aside the traffic analysis as hard, whereas in attack
two we're assuming it's trivial and perfect. But the simpler reason is:
if your guard is going to participate in a traffic correlation attack
against you, then it could just as easily team up with some other relay
that the adversary picked. That is, avoiding reusing your guard on the
other end of the circuit isn't going to save you if your guard is out
to get you.

Part of why it's hard to compare these two attacks directly is because
one is a client-side-observer adversary and the other is a relay-level
adversary.

Let's look at "attack one" from a relay-level-adversary perspective:
if your first guard is bad, you're screwed already. But if that second
guard might be bad, you really want to do anything you can do to not
reach out to it even once.

And "attack two" from the client-side-observer-level-adversary
perspective: well, if the attacker is watching the *client*, there's
no visible hint that it's reusing its guard later in the path -- and
that's the whole point. But if the attacker is watching the *relay*, then
suddenly we don't have as much diversity of traffic location as we thought
we had. That is, even if your relay is nice, somebody watching the relay's
network could do the pairwise correlation attacks we described earlier.

Another part of what bothers me about attack two -- the one where the
adversary gives you your fourth hop -- is that the adversary has *other*
hops in their side of the circuit, and you don't even know about them.
What if they chose your guard for their middle hop? Or for *their*
guard? There's nothing you can do about those cases, because you can't
know that they're happening. My conclusion is that if we can't solve
significant instances of this attack, we should be wary of paying a
large price to solve only a piece of it.

Post by Mike Perry
If Tor decided to stop enforcing /16, node family, and also allowed the
guard node to be chosen twice in the path, then under normal conditions,
it should retain the use of its primary guard.

To be clear, the design I've been considering here is simply allowing
reuse between the guard hop and the final hop, when it can't be avoided. I
don't mean to allow the guard (or its family) to show up as all four
hops in the path. Is that the same as what you meant, or did you mean
something more thorough?

I think "can't be avoided" means HSDir, IP, RP -- which I note are all
onion service related circuits.

I'd like to hear more about the "cleverly crafted exit policy" attack, and
I wonder if we can't solve that differently. For example, if it's about
making you do a request to a port that only one exit relay allows, and
ha ha whoops your guard was on the same /16 as that exit relay... maybe
it's time for the dir auths to not advertise super rare ports? This was
one of the topics in the users-get-routed paper too.

One non-starter idea would be to move onion-service-related Tors to two
guards, and leave other Tors at one guard. It's a non-starter because of
course advertising which you are to your local network is no good. But
that idea gave me a different perspective on this discussion: I wonder
how much this design decision comes down to making all Tors use two
guards in order to protect the onion-service-related Tors, which are
the only ones who actually need it?

Post by Mike Perry
This approach is not as extreme as it seems on face. In fact, it is hard
to come up with arguments against removing these restrictions. Tor's
/16 restriction is of questionable utility against monitoring, and it can
be argued that since only good actors use node family, it gives influence
over path selection to bad actors in ways that are worse than the benefit
it provides to paths through good actors[10,11].

Yep.

One remaining feature for MyFamily though is that relay operators can say
"No, even though I run these eight relays, I'm not in a position to do
traffic correlation attacks on users, because I told the users to not
put me in that position." This angle of the feature is about protecting
relays, not about protecting clients.

Post by Mike Perry
However, while removing path restrictions will solve the immediate
problem, it will not address other instances where Tor temporarily opts
use a second guard due to congestion, OOM, or failure of its primary
guard, and we're still running into bugs where this can be adversarially
controlled or just happen randomly[5].

I continue to think we need to fix these. I'm glad to see that George
has been putting some energy into looking more at them. The bugs that
we don't understand are especially worrying, since it's hard to know
how bad they are. Moving to two guards might put a bit of a bandaid on
the issues, but it can't be our long-term plan for fixing them.

I like this general idea of not immediately replacing guards so long as
you have a working one. In fact, we used to do something similar back
in the day:
https://blog.torproject.org/improving-tors-anonymity-changing-guard-parameters
says (emphasis mine)
"""
Tor 0.2.3's entry guard behavior is "choose three guards, ***adding
another one if two of those three go down*** but going back to the
original ones if they come back up, and also throw out (aka rotate)
a guard 4-8 weeks after you chose it."
"""

There are still some fiddly decisions to make here. For example, as you
say we probably shouldn't replacement a guard just because we failed to
connect to one of our guards once. We might decide that it's time to add
a new second guard if the consensus tells us that one of them is down
(so we have confirmation that it isn't down for just us, it's down for
everybody). Or we might decide to wait on adding a new one even if it
really is down, because maybe it'll come back soon. But how long do
we wait? And if, while we're down to one, we encounter one of these
situations where the requested fourth hop overlaps with our remaining
guard, what do we do?

In fact, here's a hopefully useful insight that I've just realized:
you're not concerned about one guard vs two guards, you're concerned
about *transitioning* between guards. It's that moment when you're
starting to use a new guard, if the attacker can observe that you're
doing it, and especially if the attacker can make you do it, that is
vulnerable. And starting with two guards can help, in that it postpones
the time until you're forced to transition, and maybe also because if
we do it right it can make the transition less visible.

But I wonder if we're looking at this backwards, and the primary
question we should be asking is "How can we protect the transition between
guards?" Then one of the potential answers to consider is "Maybe we should
start out with two guards rather than just one." Framing it that way,
are there more options that we should consider too? For example, removing
the ability of the non-local attacker to trigger a transition? Then
there would still be visibility of a transition, but the (non-local)
attacker can't impact the timing of the transition. How much does that
solve? Need to think more.

Post by Mike Perry
3.2. No Guard-flagged nodes as exit, RP, IP, or HSDIRs
Similar to 3.1, we could instead forbid the use of Guard-flagged nodes
for the exit, IP, RP, and HSDIR positions.
This solution has two problems: First, like 3.1, it also does not handle
the case where resource exhaustion could force the use of a second
guard. Second, it requires clients to upgrade to the new behavior and
stop using Guard flagged nodes before it can be deployed.

I'm not much of a fan of this approach (it seems so inelegant!), but
I find the two problems that you identified to be unsatisfying for
ruling it out. I wonder if we can find some stronger arguments against
this approach?

Otherwise I might find myself starting to like it. :)

One stronger argument might be: "the attacker can always use Guard-flagged
nodes for other hops on its half of the circuit, and you wouldn't even
be able to know that it's doing it, so if the goal is to never have a
circuit with your guard both at your end and also reused elsewhere in
the circuit, sorry you can't achieve that goal, so stop messing stuff
up while trying to achieve what can only ever be a partial solution."

Post by Mike Perry
4. The future is confluxed
An additional benefit of using a second guard is that it enables us to
eventually use conflux[6].

I think the performance benefits are the main arguments in favor
of doing two guards. In fact, I still think that it's mainly a
performance-vs-safety tradeoff.

I agree with George that moving to two guards now so that we can maybe
do Conflux later is doing it the wrong way round. Since it's so easy
to switch to two guards, that should be one of the very easy steps in
moving to Conflux when we do, and taking the safety hit now in exchange
for the potential performance benefit later doesn't seem best.

But there's another performance argument we shouldn't forget: if you have
two guards, you're much more likely to have at least one guard that's
adequately fast. Right now some of the guards are fast (relative to
others), and some are slow (relative to others). If you get one of the
lower-end guards, your Tor performance is sad -- for months! We tried
to mitigate that issue when we switched to one guard, by raising the
required bandwidth to get the Guard flag, so there would be no truly
terrible guards. But still, some guards are more equal than others.

This issue came up especially in the context of the December/January CPU
overload attacks, where some guards were overwhelmed by circuit creation
requests, and if you had a happy guard, lucky you, but if you had a sad
guard, you might as well delete your Tor Browser and try again.

Now, in an ideal world we should come up with fixes for all of those other
issues, for example by taking the Guard flag away from relays that can't
be great guards. But in the world we live in right now, we can relieve
some of that pressure-to-be-perfect by giving people two guards.

But if we're only going on a performance vs safety basis, I don't see a
huge rush to trade off safety until we have a better handle on what sort
of performance benefits we'd actually get, and until we've compared to
other low-hanging performance fruit.

In summary:

(1) I think we should fix the bug from #14917 where the attacker can
push us off our guard just by naming our guard as the HSDir/IP/RP,
and I think we should fix it by being willing to reuse our guard when
it can't be avoided. That step will resolve some, but not all, of the
pressure about moving to two guards. Then

(2) Hopefully the above discussion has helped us move forward on the
remaining reasons for switching to two guards. To me the two biggest
questions left to resolve are (a) how best to protect the vulnerable
transition to a new guard, and if two guards is the best idea we've got
for that, and (b) how big an issue is it really that having only one
guard can sometimes give you a low-performance guard, and if two guards
is the best idea we've got for that one too.

--Roger

Mike Perry

2018-04-11 11:15:44 UTC

Post by Mike Perry
3.1. Eliminate path restrictions entirely

I agree. I am not concerned about attack two. But we're not choosing
between just these two attacks.

Post by Roger Dingledine
To be clear, the design I've been considering here is simply allowing
reuse between the guard hop and the final hop, when it can't be avoided. I
don't mean to allow the guard (or its family) to show up as all four
hops in the path. Is that the same as what you meant, or did you mean
something more thorough?

By all path restrictions I mean for the last hop of the circuit and the
first (though vanguards would be simpler if we got rid of them for other
hops, too). But I do mean all restrictions, not just guard node choice.
The adversary also gets to force you to use a second network path
whenever they want via the /16 and node family restrictions. And it
happens naturally all the time.

We're not using one guard in the current Tor. We're using two, and the
second one is only used for unmultiplexed activity. That is one property
I don't like about our "let's pretend to use one guard" status quo.

The second thing I don't like is that one guard is fragile, which
enables confirmation attacks when it can be made to go down.

Post by Roger Dingledine
I think "can't be avoided" means HSDir, IP, RP -- which I note are all
onion service related circuits.
I'd like to hear more about the "cleverly crafted exit policy" attack, and
I wonder if we can't solve that differently. For example, if it's about
making you do a request to a port that only one exit relay allows, and
ha ha whoops your guard was on the same /16 as that exit relay... maybe
it's time for the dir auths to not advertise super rare ports? This was
one of the topics in the users-get-routed paper too.

Yes that is the one I was talking about.

However, another way to do this type of exit rotation attack is to cause
a client to look up a DNS name where you control the resolver, and keep
timing out on the DNS response. The client will then retry the stream
request with a new exit. The same thing can also be done by timing out
the TCP handshake to a server you control. Both of these attacks can be
done with only the ability to inject an img tag into a page.

You repeat this until an exit is chosen that is in the same /16 or
family as the guard, and then the client uses a second network path for
an unmultiplexed request at a time you control.

Post by Roger Dingledine
One non-starter idea would be to move onion-service-related Tors to two
guards, and leave other Tors at one guard. It's a non-starter because of
course advertising which you are to your local network is no good. But
that idea gave me a different perspective on this discussion: I wonder
how much this design decision comes down to making all Tors use two
guards in order to protect the onion-service-related Tors, which are
the only ones who actually need it?

Our path restrictions also cause normal exiting clients to use a second
guard for unmultiplexed activity, at adversary controlled times, or just
at periodically at random.

We're choosing fixes for these bugs that enable an adversary to deny
service to clients at a particular guard, *without* letting those
clients move to a second guard. This enables confirmation attacks, and
these confirmation attacks can be extended to guard discovery attacks by
DoSing guards one at a time until an onion service fails.

Bringing back CREATE_FAST could help with this piece, I suppose, but it
doesn't solve OOM attacks...

I like this general idea of not immediately replacing guards so long as
you have a working one. In fact, we used to do something similar back
https://blog.torproject.org/improving-tors-anonymity-changing-guard-parameters
says (emphasis mine)
"""
Tor 0.2.3's entry guard behavior is "choose three guards, ***adding
another one if two of those three go down*** but going back to the
original ones if they come back up, and also throw out (aka rotate)
a guard 4-8 weeks after you chose it."
"""
There are still some fiddly decisions to make here. For example, as you
say we probably shouldn't replacement a guard just because we failed to
connect to one of our guards once. We might decide that it's time to add
a new second guard if the consensus tells us that one of them is down
(so we have confirmation that it isn't down for just us, it's down for
everybody). Or we might decide to wait on adding a new one even if it
really is down, because maybe it'll come back soon. But how long do
we wait? And if, while we're down to one, we encounter one of these
situations where the requested fourth hop overlaps with our remaining
guard, what do we do?

If I were to drop everything to build the Tor I think should exist, I
would do the following:

1. Use two guards, replacing them only when both are unreachable, or
when one leaves the consensus.
2. Make path restrictions not as strict (for cases like the one above).
3. Use conflux (which also needs less strict/no path restrictions)
4. Build it on QUIC.

I would do them in that order because I think we get the most benefit
from #1, and we get some benefit from #2 still (as you point out above).

You keep focusing on the performance aspects of conflux, but that is not
the argument I am making. My arguments for conflux in Section 4 are
about resilience to congestion, downtime, circuit killing, and DoS, as
well as traffic analysis resistance. I see the performance benefits as
secondary.

(I also think the best arguments for QUIC are also in the reliability
direction, because fixed queues means no adversary provoked OOMing.)

Post by Roger Dingledine
you're not concerned about one guard vs two guards, you're concerned
about *transitioning* between guards. It's that moment when you're
starting to use a new guard, if the attacker can observe that you're
doing it, and especially if the attacker can make you do it, that is
vulnerable. And starting with two guards can help, in that it postpones
the time until you're forced to transition, and maybe also because if
we do it right it can make the transition less visible.

The transition aspect is a big piece of it, but I think we're also
running into a fragility problem, which makes the transition signal very
loud in many cases.

Post by Roger Dingledine
But I wonder if we're looking at this backwards, and the primary
question we should be asking is "How can we protect the transition between
guards?" Then one of the potential answers to consider is "Maybe we should
start out with two guards rather than just one." Framing it that way,
are there more options that we should consider too? For example, removing
the ability of the non-local attacker to trigger a transition? Then
there would still be visibility of a transition, but the (non-local)
attacker can't impact the timing of the transition. How much does that
solve? Need to think more.

One guard is inherently more fragile than two, and no matter what we do,
it means that there will be a risk of attacks that can confirm guard
choice, because the downtime during this transition can never be hidden
without at least some redundancy.

Post by Roger Dingledine
(1) I think we should fix the bug from #14917 where the attacker can
push us off our guard just by naming our guard as the HSDir/IP/RP,
and I think we should fix it by being willing to reuse our guard when
it can't be avoided. That step will resolve some, but not all, of the
pressure about moving to two guards. Then

Without removing all path restrictions that apply to first and last hop,
we're still actually using two guards, and using them at times that the
adversary gets to control if they want, or just randomly otherwise.

Post by Roger Dingledine
(2) Hopefully the above discussion has helped us move forward on the
remaining reasons for switching to two guards. To me the two biggest
questions left to resolve are (a) how best to protect the vulnerable
transition to a new guard, and if two guards is the best idea we've got
for that, and (b) how big an issue is it really that having only one
guard can sometimes give you a low-performance guard, and if two guards
is the best idea we've got for that one too.

Transitions will always be noisy with one guard, because it is fragile
to DoS, congestion, OOM, circuit failure, onionskin overload, etc etc
etc. How can you provide resiliency under arbitrary and partial failure
without any redundancy?

--
Mike Perry

Roger Dingledine

2018-04-18 08:27:51 UTC

By all path restrictions I mean for the last hop of the circuit and the
first (though vanguards would be simpler if we got rid of them for other
hops, too).

Can you lay out for us the things to think about in the Vanguard design?
Last I checked there were quite a few Vanguard design variants, ranging
from "two vanguards per guard, tree style" to some sort of mesh.

In particular, it would be convenient if there is a frontrunner design
that really would benefit from relaxing many path restrictions, and a
frontrunner design that is not so tied together to the path restriction
question.

Post by Mike Perry
But I do mean all restrictions, not just guard node choice.
The adversary also gets to force you to use a second network path
whenever they want via the /16 and node family restrictions.

Can you give us a specific example here, for this phrase "network
path"? When you say "second network path" are you thinking in the
Vanguard world?

Post by Mike Perry
We're not using one guard in the current Tor. We're using two, and the
second one is only used for unmultiplexed activity. That is one property
I don't like about our "let's pretend to use one guard" status quo.

Right, I agree.

Post by Roger Dingledine
I'd like to hear more about the "cleverly crafted exit policy" attack

another way to do this type of exit rotation attack is to cause
a client to look up a DNS name where you control the resolver, and keep
timing out on the DNS response. The client will then retry the stream
request with a new exit. The same thing can also be done by timing out
the TCP handshake to a server you control. Both of these attacks can be
done with only the ability to inject an img tag into a page.
You repeat this until an exit is chosen that is in the same /16 or
family as the guard, and then the client uses a second network path for
an unmultiplexed request at a time you control.

Hm! Yes, this is a yucky one. (I don't think just an img tag would be
enough, because Tor will try a few circuits and then give up. You'd need
some sort of javascript or refresh chain or the like that generates new
addresses and tries them in succession. But that's totally feasible.)

This one is also yucky because we could also imagine a different way to
pick your path, where when you're selecting your exit, you avoid choosing
exits which would conflict with your guard, and thus you'll never be
pushed off of your guard. But then the destination website can do this
same attack over time and notice which exit you never try to use. So
this is a case where to blend in best, we *need* to be willing to use
all of the potential exits.

But since normal exit circuits are three hops, if we simply relax the
path restrictions, we could be making a circuit of the form "A - B - A",
which would not only stand out as weird to B, but actually right now a
relay in B's position will refuse such a circuit. Bad news all around.

The three fixes that come to mind are

(A) "Have two guards": so you can pick any exit you like, and then just
use the guard that doesn't conflict with the exit you picked.

(B) "Add a bonus hop when needed": First relax the /16 and family
restrictions, so the remaining issue is reuse of your guard. Then if
you find that you just chose your guard as your exit, insert an extra
hop in the middle of that circuit.

(C) "Exits can't be Guards": First relax the /16 and family restrictions,
so the remaining issue is reuse of your guard. Then notice that due
to exit scarcity, guards aren't actually used in the exit position
anyway. Then enforce that rule (so they can't be in the future either).

All three of these choices have downsides. But all three of them look
like improvements over the current situation -- because of how crappy
the current situation is.

(Rejected option (D): "Just start allowing it": Relax the /16 and
family restrictions, and also relax the rule where relays refuse a
circuit that goes right back where it came from. Giving the middle node
that much information about the circuit just wigs me out.)

Also, notice that I think Mike's proposed design will turn out to be some
combination of "A" and also something like "B" or "C", because even if
you start with two guards, if you don't add a new guard right when your
first guard goes down, you might find yourself in the situation where
you have one working guard, and you pick it as your exit, and now you
need to do *something*.

Post by Mike Perry
Our path restrictions also cause normal exiting clients to use a second
guard for unmultiplexed activity, at adversary controlled times, or just
at periodically at random.

Just to make sure I understand: at least on the current network,
that's because of the /16 rule and the family rule, and not because of
the "if the exit you picked turns out to be your guard too, move to a
different guard" rule, because exits aren't normally used for guards on
our current network?

On more examination though, that's not something to rely on with our
current design, since I bet there are weird edge cases like a relay
loses its Guard flag, but it's still your Guard so you keep using it
(depending on the advice del aÃ±o from #17773), but now the weightings
let you pick it for your Exit, and oops.

Another problematic example would be a relay that you picked as your
Guard, and later it opened up its exit policy and became an Exit.

So if I wanted to try to flesh out my "Then enforce that rule" approach
above, we would need to (1) Have dir auths take away the Guard flag from
relays that can be used as Exits, and (2) Make sure that clients know
that if their guards lose the Guard flag, they should treat them as being
no longer guardworthy. I think we're doing that second one right now,
based on my latest reading of #17773, so this would actually be a pretty
easy change. But still, it's not exactly elegant.

I would find non-onion-service examples more compelling here, since I
want to avoid falling back into the "well, onion services need special
treatment to be safe, so we have to choose between hurting normal clients
and hurting onion services" trap.

How is this for an alternative scenario to be considering: the attacking
website gives the Tor Browser user some page content that causes the
browser to initiate periodic events. Then it starts congesting guards
one at a time until the events stop arriving.

Are those two scenarios basically equivalent in terms of the confirmation
attacks you are worrying about? I hope yes, and now I can stop getting
distracted by wondering if going to this effort is worth it only to
protect onion services? :)

Post by Mike Perry
You keep focusing on the performance aspects of conflux, but that is not
the argument I am making. My arguments for conflux in Section 4 are
about resilience to congestion, downtime, circuit killing, and DoS, as
well as traffic analysis resistance. I see the performance benefits as
secondary.

I like conflux in theory, but somebody needs to do the other 90%
of the work to make it a concrete thing that we can consider.

I continue to think "Tor should switch to two guards, because one day
we should design and deploy conflux" is a terrible reason to switch to
two guards now.

So I didn't mean to mix the conflux discussion and the performance
discussion. I meant to mostly ignore the conflux discussion (because it
is a future proposal, not this one), while also making sure that we don't
forget the potential performance benefits of having two guards in general.

How's this for another option: clients have two guards, but they have
a first guard and a backup guard. They do the traffic padding to both
of them, to ensure continuous netflow sessions in their local ISP's
logs. But they try to send most of their traffic over the first guard,
thus avoiding most of the "increased surface area" concerns about using
two guards at once. And we try to reduce the frequency of situations where
they can't use their first guard. But in the "transition" situations
that we decide we need to keep, they use their backup guard, and it's
already available and ready and that netflow session is already active
in the eyes of their ISP.

This approach isn't conflux (yet), but it's not incompatible with later
changing things so we do conflux.

It also doesn't get us the lower variance of performance that having
two equally used guards would get us. But I am ok with that for now,
at least until somebody has done some performance analysis to show that
we're really suffering now and we would stop suffering then.

It adds load onto the relays, by almost doubling the number of sockets
used by guards for clients, and also by adding more bandwidth load from
the padding cells to/from the backup guard. (How much bandwidth load is
this, per client?)

And it doesn't actually provide as much "real" cover traffic onto the
backup guard in most situations, so somebody who can look more thoroughly
at the traffic flows will still be able to distinguish a transition
event from the first to the backup. Maybe that's a problem? Or maybe
the netflow level adversary that we declared in the threat model can't
do that, and a real attacker would be able to see the traffic details
anyway, so we're fine^W^Wno worse off than before?

Assuming this design meets all of our goals, let's examine two variants
of it to make sure we understand what we're actually trading off. In
particular, consider a design where we maintain (and pad) these two
connections, vs a design where we maintain a connection to our first
guard and then launch a connection to the backup guard on demand. The
downside of keeping the backup connection open is the extra network-wide
socket and bandwidth load on relays, while the downsides of launching
a connection on demand are the risk that a local netflow-level ISP can
see when we transition to using the backup guard, plus the risk that a
remote attacker who can cripple guards will be able to notice the delay
in the "launch on demand case" but could not distinguish the delay in
the "two connections" case.

That second risk doesn't seem so scary to me, since local handshakes
should be a small fraction of the overall time it takes to build and use
a new circuit. But above you say "the downtime during this transition can
never be hidden without at least some redundancy", so if you think this
risk is scary, I'd like to hear more details about why. (Maybe the design
you were concerned about was one where we just freeze in place and fail
when we don't want to use our first guard? I agree, that's a bad design,
and we can do better, for example by "be willing to use the second guard".)

Whereas that first risk does seem plausible to me -- worth trying to
reduce. I think we should start by enumerating as many scary scenarios
as we can (where scary means "currently we would shift away from our
first guard"), and then fix as many of them as we can. Then we should
look at the remaining scenarios where we would switch over to using our
backup guard (like, when our first guard isn't able to build new circuits
for us), and decide if the cost of the additional load on the network is
worth hiding that transition timing from a netflow-level client-side-ISP
adversary. I can see the answer being "yes, it's worth it", but I think it
will be useful to have a good handle on which transition scenarios remain.

--Roger

Mike Perry

2018-04-18 23:31:26 UTC

By all path restrictions I mean for the last hop of the circuit and the
first (though vanguards would be simpler if we got rid of them for other
hops, too).

Can you lay out for us the things to think about in the Vanguard design?
Last I checked there were quite a few Vanguard design variants, ranging
from "two vanguards per guard, tree style" to some sort of mesh.
In particular, it would be convenient if there is a frontrunner design
that really would benefit from relaxing many path restrictions, and a
frontrunner design that is not so tied together to the path restriction
question.

There are two frontrunner forms. One has no path restrictions, the other
would try to perform restriction checks on each layer to ensure that it
is valid and doesn't leak info about other layers or prevent circuit
creation.

They are otherwise the same. Both are mesh; both are tunable in the
number of guards and rotation times in each layer.

I am leaning towards "no restrictions" for vanguards for 0.3.4 because
it is simpler, and it did not strike me that the arguments in their
favor justified trying to implement them quickly in a way that might
cause reachability or path influence risks.

Can you give us a specific example here, for this phrase "network
path"? When you say "second network path" are you thinking in the
Vanguard world?

Second path to entry into the Tor network (and a second guard),
regardless of vanguards.

Post by Roger Dingledine
I'd like to hear more about the "cleverly crafted exit policy" attack

The three fixes that come to mind are
(A) "Have two guards": so you can pick any exit you like, and then just
use the guard that doesn't conflict with the exit you picked.
(B) "Add a bonus hop when needed": First relax the /16 and family
restrictions, so the remaining issue is reuse of your guard. Then if
you find that you just chose your guard as your exit, insert an extra
hop in the middle of that circuit.
(C) "Exits can't be Guards": First relax the /16 and family restrictions,
so the remaining issue is reuse of your guard. Then notice that due
to exit scarcity, guards aren't actually used in the exit position
anyway. Then enforce that rule (so they can't be in the future either).
All three of these choices have downsides. But all three of them look
like improvements over the current situation -- because of how crappy
the current situation is.
(Rejected option (D): "Just start allowing it": Relax the /16 and
family restrictions, and also relax the rule where relays refuse a
circuit that goes right back where it came from. Giving the middle node
that much information about the circuit just wigs me out.)
Also, notice that I think Mike's proposed design will turn out to be some
combination of "A" and also something like "B" or "C", because even if
you start with two guards, if you don't add a new guard right when your
first guard goes down, you might find yourself in the situation where
you have one working guard, and you pick it as your exit, and now you
need to do *something*.

The one-guard-down case does impact things. But even when this does
happen (which should be rare), it should only be true for a small window
of time before the consensus updates.

The "down" guard should either be temporarily overloaded, or fully down
and kicked off the consensus. I think we should only add a new guard
when one falls out of the consensus, or both are unreachable/unusable.

This is why I think it is OK to take an incremental approach and
start with A, and roll out things like B and C and other restriction
relaxations.

During these edge cases, the most important property that we should
strive to preserve is overall reachability. I don't like situations
where the adversary gains information by certain nodes being overloaded
or down. In my view, trying to make smart decisions to minimize exposure
to more nodes is secondary to overall reachability. (Overall
reachability allows a *non-network* adversary to gain information about
how clients are using our network. That strikes me as a lower resource,
more dangerous attack than the unknown risk of possible partial network
observers. In other words, I believe we made the right short-term call
in #14917 in terms of preserving reachability.)

Post by Mike Perry
Our path restrictions also cause normal exiting clients to use a second
guard for unmultiplexed activity, at adversary controlled times, or just
at periodically at random.

I am in favor of preventing guards from being exits. Intuitively, it
means less "one stop shop" surveillance points to see both entry and
exit traffic. It also makes flag-based load balancing equations much
simpler, and makes it easier to account for padding overhead.

Post by Roger Dingledine
So if I wanted to try to flesh out my "Then enforce that rule" approach
above, we would need to (1) Have dir auths take away the Guard flag from
relays that can be used as Exits, and (2) Make sure that clients know
that if their guards lose the Guard flag, they should treat them as being
no longer guardworthy. I think we're doing that second one right now,
based on my latest reading of #17773, so this would actually be a pretty
easy change. But still, it's not exactly elegant.

In the world where we keep path restrictions, these would be my rules:
1. Two equal guards, chosen from not the same /16 or family
2. Choose each vanguard layer members such that each layer has at least
one node from a unique /16 and family.
3. Build paths in a strict order, from last hop towards guard. If you
can't build a path with this ordering, start over with a sampled guard.
(With rule #1 and #2, this should be very rare and should mean that
a guard is marked down locally but still marked up in the consensus.)
4. No guards as exits (Not needed but do it anyway for other reasons).

Then under these rules, you decide to use a new primary guard, if:
0. When a guard leaves the consensus, replace it with a new primary
guard.
1. Temporarily pick a new guard when your two primaries are locally down
or unusable (ie step #3 above fails).

I would find non-onion-service examples more compelling here, since I
want to avoid falling back into the "well, onion services need special
treatment to be safe, so we have to choose between hurting normal clients
and hurting onion services" trap.
How is this for an alternative scenario to be considering: the attacking
website gives the Tor Browser user some page content that causes the
browser to initiate periodic events. Then it starts congesting guards
one at a time until the events stop arriving.
Are those two scenarios basically equivalent in terms of the confirmation
attacks you are worrying about? I hope yes, and now I can stop getting
distracted by wondering if going to this effort is worth it only to
protect onion services? :)

Yes.