[tor-dev] Understanding the guard/md issue (#21969)

Discussion:

George Kadianakis

2017-10-28 14:19:34 UTC

Hey Tim,

just wanted to ask a clarifying question wrt #21969.

First of all there are various forms of #21969 (aka the "missing
descriptors for some of our primary entry guards" issue). Sometimes it
occurs for 10 mins and then goes away, whereas for other people it
disables their service permanently (until restart). I call this the
hardcore case of #21969. It has happened to me and disabled my service
for days, and I've also seen it happen to other people (e.g. dgoulet).

So. We have found various md-related bugs and put them as children of
#21969. Do you think we have found the bugs that can cause the hardcore
case of #21969? That is, is any of these bugs (or a bug combo) capable
of permanently disabling an onion service?

It seems to me that all the bugs identified so far can only cause #21969
to occur for a few hours before it self-heals itself. IIUC, even the
most fundamental bugs like #23862 and #23863 are only temporarily, since
eventually one of the dirguards will fetch the missing mds and give them
to the client. Do you think that's the case?

I'm asking you because I plan to spend some serious time next week on
#21969-related issues, and I'd like to prioritize between bug hunting
and bug fixing. That is, if the root cause of the hardcore case of
#21969 is still out there, I'd like to continue bug hunting until I find
it.

Let me know what you think! Perhaps you have other ideas here of how we
should approach this issue.

Cheers!! :)

PS: Sending this as an email since our timezones are making it kind hard
to synch up on IRC.

teor

2017-10-28 21:59:02 UTC

Permalink

Post by George Kadianakis
Hey Tim,
just wanted to ask a clarifying question wrt #21969.
First of all there are various forms of #21969 (aka the "missing
descriptors for some of our primary entry guards" issue). Sometimes it
occurs for 10 mins and then goes away, whereas for other people it
disables their service permanently (until restart). I call this the
hardcore case of #21969. It has happened to me and disabled my service
for days, and I've also seen it happen to other people (e.g. dgoulet).
So. We have found various md-related bugs and put them as children of
#21969. Do you think we have found the bugs that can cause the hardcore
case of #21969? That is, is any of these bugs (or a bug combo) capable
of permanently disabling an onion service?

Yes, this bug is disabling:

#23862, where we don't update guard state unless we have enough
directory info.

When tor gets in a state where it doesn't have enough directory info
due to another bug, this makes sure it will never get out of that state.
Because it will never mark its directory guards as up when it gets a
new consensus, and therefore it will never fetch microdescs, find out
it has enough directory info, and build circuits.

That's why I made sure we fixed it as soon as possible. I'm glad it's in
the latest alpha.

And this (and a few of the other #21969 children) makes it happen:

#23817, where we keep trying directory guards even though they don't
have the microdescriptors we want, on an exponential backoff.

Because it causes tor to only check for new microdescriptors after a
very long time (days or weeks), which means the microdescs can expire
before they are refreshed.

Post by George Kadianakis
It seems to me that all the bugs identified so far can only cause #21969
to occur for a few hours before it self-heals itself. IIUC, even the
most fundamental bugs like #23862 and #23863 are only temporarily, since
eventually one of the dirguards will fetch the missing mds and give them
to the client. Do you think that's the case?

No, the current set of bugs can block microdesc fetches forever.
And even if they do happen eventually, "eventually" on an exponential backoff
is indistinguishable from "forever" over short time frames. (This is by design,
it's part of the definition of an exponential backoff.)

Post by George Kadianakis
I'm asking you because I plan to spend some serious time next week on
#21969-related issues, and I'd like to prioritize between bug hunting
and bug fixing. That is, if the root cause of the hardcore case of
#21969 is still out there, I'd like to continue bug hunting until I find
it.
Let me know what you think! Perhaps you have other ideas here of how we
should approach this issue.

Fix #23817 by implementing a failure cache and going to a fallback if all
primary guards fail. I think that would be a solution for #23863 as well.

And if a few fallbacks don't have the guard's microdesc, mark the guard as
down. It's likely it's microdesc is just not on the network for some reason.

Fix #23985, the 10 minute wait when we have less than 15 microdescs,
by changing it to an exponential backoff. Otherwise, if we handle it
specially when it includes our primary guards, clients will leak that their
primary guards are in this small set. (And if we're using an exponential
backoff, the failure cache from #23817 will kick in, so we'll check
fallbacks, then mark the primary guard down.)

After that, I'd put these fixes out in an alpha, and wait and see if the issue
happens again.

T

George Kadianakis

2017-10-30 11:30:44 UTC

Permalink

Post by teor

Thanks for the reply, Tim.

Post by teor
#23862, where we don't update guard state unless we have enough
directory info.
When tor gets in a state where it doesn't have enough directory info
due to another bug, this makes sure it will never get out of that state.
Because it will never mark its directory guards as up when it gets a
new consensus, and therefore it will never fetch microdescs, find out
it has enough directory info, and build circuits.

Hmm, just want to make sure I get this.

My understanding with #23862 is that Tor would never mark its directory
guards as up like you say, but it _would still_ fetch microdescs using
fallback directories because of the way
directory_pick_generic_dirserver() works. Isn't that the case?

teor

2017-10-30 12:31:41 UTC

Permalink

--
Tim / teor

PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
ricochet:ekmygaiu4rzgsk6n
------------------------------------------------------------------------

Post by George Kadianakis

Post by teor

Thanks for the reply, Tim.

Hmm, just want to make sure I get this.
My understanding with #23862 is that Tor would never mark its directory
guards as up like you say, but it _would still_ fetch microdescs using
fallback directories because of the way
directory_pick_generic_dirserver() works. Isn't that the case?

No, because we're not actually marking those guards as down (#23863),
I think we might be putting them in a partly usable guard state instead.
(Fallbacks are only used when all directory guards or mirrors are down.)

And so we keep trying guards until we backoff for a long time (#23817).

Which means that some of our microdescs start expiring or change in the
consensus. Which triggers #23862, which we fixed in 0.3.2.3-alpha.

And we don't reset the right download state when we get an application
request (#23620). Which makes it hard for tor to recover from this bug.

I might be wrong or missing a few of the details, but those bugs are
enough to cause the issues we're seeing. Hopefully we can find any
remaining bugs when we fix these ones.

This set of bugs is actually better than the alternative, which is clients
trying too fast, and DDoSing relays. If we hadn't implemented exponential
backoff, these retries could have caused either a slow or fast DDoS.

T

George Kadianakis

2017-10-30 12:34:45 UTC

Permalink

Post by teor

<snip>
Let me know what you think! Perhaps you have other ideas here of how we
should approach this issue.

Hey Tim!

Please see
https://trac.torproject.org/projects/tor/ticket/23817#comment:6 for an
implementation plan of the failure cache concept. If it makes sense to
you, I will try to implement it this week.

Cheers!

George Kadianakis

2017-11-15 13:14:48 UTC

Permalink

Post by George Kadianakis
Hey Tim,

OK updates here.

We merged #23895 and #23862 to 032 and master.

#23817 is now in needs_review and hopefully will get in the next 032 alpha.
I think this next alpha should be much better in terms of mds.

Next tickets in terms of importance should probably be #23863 and #24113.
I have questions/feedback in both of them, and I'm ready to move in.

Cheers!