Discussion:
[tor-dev] Why do DirAuths take that long to update a relay's version information?
nusenu
2018-01-10 13:43:00 UTC
Permalink
Hi,

the goal of this email is to avoid a false positive warning for relay operators
on atlas but the root cause might be in core tor.

background:
I really liked when irl added the big red warning to atlas when a tor relay
runs an outdated (aka not running a "recommended") tor version
because it actually triggered operators to upgrade, an important step toward a more healthy network.
The problem is: This big red banner on atlas has false-positives which confuses operators [0].

Originally this has been an onionoo bug which
has been fixed in v1.8.0, but it happens again and Karsten had the feeling
that tor dir auths do not update the version information of a relay after it
upgraded (and uploaded a new descriptor). I looked into one example and can confirm what Karsten suggested [1].

Let me show you that example of FP 1283EBDEEC2B9D745F1E7FBE83407655B984FD66.
Data has been provided by Karsten and is available here: [2].

That relay was running 0.3.0.10 and upgraded to 0.3.0.13 and uploaded his first
descriptor with 0.3.0.13 on:

2018-01-09 10:14:00,server,,0.3.0.13

except for bastet dir auths did not care and still said this relay runs
0.3.0.10:

2018-01-09 11:00:00,consensus,,0.3.0.10
2018-01-09 11:00:00,vote,bastet,0.3.0.13 <<<< note
2018-01-09 11:00:00,vote,dannenberg,0.3.0.10
2018-01-09 11:00:00,vote,dizum,0.3.0.10
2018-01-09 11:00:00,vote,Faravahar,0.3.0.10
2018-01-09 11:00:00,vote,gabelmoo,0.3.0.10
2018-01-09 11:00:00,vote,longclaw,0.3.0.10
2018-01-09 11:00:00,vote,maatuska,0.3.0.10
2018-01-09 11:00:00,vote,moria1,0.3.0.10
2018-01-09 11:00:00,vote,tor26,0.3.0.10
2018-01-09 12:00:00,consensus,,0.3.0.10
2018-01-09 12:00:00,vote,bastet,0.3.0.13 <<<<<<
2018-01-09 12:00:00,vote,dannenberg,0.3.0.10
2018-01-09 12:00:00,vote,dizum,0.3.0.10
2018-01-09 12:00:00,vote,Faravahar,0.3.0.10
2018-01-09 12:00:00,vote,gabelmoo,0.3.0.10
2018-01-09 12:00:00,vote,longclaw,0.3.0.10
2018-01-09 12:00:00,vote,maatuska,0.3.0.10
2018-01-09 12:00:00,vote,moria1,0.3.0.10
2018-01-09 12:00:00,vote,tor26,0.3.0.10

even 6 hours later this is unchanged.

Then the operator upgraded from 0.3.0.13 to 0.3.1.9
and uploaded his first descriptor:

2018-01-09 16:39:01,server,,0.3.1.9

this remained "unnoticed" by all dir auths until
longclaw voted for the new version:

2018-01-09 23:00:00,consensus,,0.3.0.10
2018-01-09 23:00:00,vote,bastet,0.3.0.10
2018-01-09 23:00:00,vote,dannenberg,0.3.0.10
2018-01-09 23:00:00,vote,dizum,0.3.0.10
2018-01-09 23:00:00,vote,Faravahar,0.3.0.10
2018-01-09 23:00:00,vote,gabelmoo,0.3.0.10
2018-01-09 23:00:00,vote,longclaw,0.3.1.9 <<<<<
2018-01-09 23:00:00,vote,maatuska,0.3.0.10
2018-01-09 23:00:00,vote,moria1,0.3.0.10
2018-01-09 23:00:00,vote,tor26,0.3.0.10

On 2018-01-10 02:38:07 the relay uploaded a second descriptor with
v0.3.1.9 and almost all dir auths agreed immediately:

2018-01-10 02:38:07,server,,0.3.1.9
2018-01-10 03:00:00,consensus,,0.3.1.9
2018-01-10 03:00:00,vote,bastet,0.3.0.10
2018-01-10 03:00:00,vote,dannenberg,0.3.1.9
2018-01-10 03:00:00,vote,dizum,0.3.1.9
2018-01-10 03:00:00,vote,Faravahar,0.3.1.9
2018-01-10 03:00:00,vote,gabelmoo,0.3.1.9
2018-01-10 03:00:00,vote,longclaw,0.3.1.9
2018-01-10 03:00:00,vote,maatuska,0.3.1.9
2018-01-10 03:00:00,vote,moria1,0.3.1.9
2018-01-10 03:00:00,vote,tor26,0.3.1.9


So it took the operator 17 hours to convince enough
dir auths that he upgraded.
I can see multiple reasons why this can make sense (as the tor version
is actually not that relevant consensus data) but maybe it was
not clear what the side effects of not updating that field are.

While I believe there is still another onionoo issue,
this should also be improved.

Thoughts?



[0] http://lists.nycbug.org/pipermail/tor-bsd/2018-January/000620.html
[1] https://trac.torproject.org/projects/tor/ticket/22488#comment:11
[2] https://trac.torproject.org/projects/tor/attachment/ticket/22488/task-22488-relay-versions.csv.gz
--
https://mastodon.social/@nusenu
twitter: @nusenu_
teor
2018-01-10 22:48:36 UTC
Permalink
Post by nusenu
Hi,
the goal of this email is to avoid a false positive warning for relay operators
on atlas but the root cause might be in core tor.
I really liked when irl added the big red warning to atlas when a tor relay
runs an outdated (aka not running a "recommended") tor version
because it actually triggered operators to upgrade, an important step toward a more healthy network.
The problem is: This big red banner on atlas has false-positives which confuses operators [0].
Originally this has been an onionoo bug which
has been fixed in v1.8.0, but it happens again and Karsten had the feeling
that tor dir auths do not update the version information of a relay after it
upgraded (and uploaded a new descriptor). I looked into one example and can confirm what Karsten suggested [1].
I have opened a feature request for consensus-health to show per-relay
versions in the details and overlap tables:

https://trac.torproject.org/projects/tor/ticket/24862

Unfortunately, consensus-health does not parse descriptors, so we will
have to rely on at least one authority picking up the new version. But
it's a start, and it will help us monitor the fix and any regressions.
Post by nusenu
Let me show you that example of FP 1283EBDEEC2B9D745F1E7FBE83407655B984FD66.
Data has been provided by Karsten and is available here: [2].
That relay was running 0.3.0.10 and upgraded to 0.3.0.13 and uploaded his first
2018-01-09 10:14:00,server,,0.3.0.13
except for bastet dir auths did not care and still said this relay runs
2018-01-09 11:00:00,consensus,,0.3.0.10
2018-01-09 11:00:00,vote,bastet,0.3.0.13 <<<< note
2018-01-09 11:00:00,vote,dannenberg,0.3.0.10
2018-01-09 11:00:00,vote,dizum,0.3.0.10
2018-01-09 11:00:00,vote,Faravahar,0.3.0.10
2018-01-09 11:00:00,vote,gabelmoo,0.3.0.10
2018-01-09 11:00:00,vote,longclaw,0.3.0.10
2018-01-09 11:00:00,vote,maatuska,0.3.0.10
2018-01-09 11:00:00,vote,moria1,0.3.0.10
2018-01-09 11:00:00,vote,tor26,0.3.0.10
2018-01-09 12:00:00,consensus,,0.3.0.10
2018-01-09 12:00:00,vote,bastet,0.3.0.13 <<<<<<
2018-01-09 12:00:00,vote,dannenberg,0.3.0.10
2018-01-09 12:00:00,vote,dizum,0.3.0.10
2018-01-09 12:00:00,vote,Faravahar,0.3.0.10
2018-01-09 12:00:00,vote,gabelmoo,0.3.0.10
2018-01-09 12:00:00,vote,longclaw,0.3.0.10
2018-01-09 12:00:00,vote,maatuska,0.3.0.10
2018-01-09 12:00:00,vote,moria1,0.3.0.10
2018-01-09 12:00:00,vote,tor26,0.3.0.10
even 6 hours later this is unchanged.
Then the operator upgraded from 0.3.0.13 to 0.3.1.9
2018-01-09 16:39:01,server,,0.3.1.9
this remained "unnoticed" by all dir auths until
2018-01-09 23:00:00,consensus,,0.3.0.10
2018-01-09 23:00:00,vote,bastet,0.3.0.10
2018-01-09 23:00:00,vote,dannenberg,0.3.0.10
2018-01-09 23:00:00,vote,dizum,0.3.0.10
2018-01-09 23:00:00,vote,Faravahar,0.3.0.10
2018-01-09 23:00:00,vote,gabelmoo,0.3.0.10
2018-01-09 23:00:00,vote,longclaw,0.3.1.9 <<<<<
2018-01-09 23:00:00,vote,maatuska,0.3.0.10
2018-01-09 23:00:00,vote,moria1,0.3.0.10
2018-01-09 23:00:00,vote,tor26,0.3.0.10
On 2018-01-10 02:38:07 the relay uploaded a second descriptor with
2018-01-10 02:38:07,server,,0.3.1.9
2018-01-10 03:00:00,consensus,,0.3.1.9
2018-01-10 03:00:00,vote,bastet,0.3.0.10
2018-01-10 03:00:00,vote,dannenberg,0.3.1.9
2018-01-10 03:00:00,vote,dizum,0.3.1.9
2018-01-10 03:00:00,vote,Faravahar,0.3.1.9
2018-01-10 03:00:00,vote,gabelmoo,0.3.1.9
2018-01-10 03:00:00,vote,longclaw,0.3.1.9
2018-01-10 03:00:00,vote,maatuska,0.3.1.9
2018-01-10 03:00:00,vote,moria1,0.3.1.9
2018-01-10 03:00:00,vote,tor26,0.3.1.9
So it took the operator 17 hours to convince enough
dir auths that he upgraded.
I can see multiple reasons why this can make sense (as the tor version
is actually not that relevant consensus data) but maybe it was
not clear what the side effects of not updating that field are.
While I believe there is still another onionoo issue,
this should also be improved.
Thoughts?
I've looked at the Tor source code that handles versions. Version
parsing and voting seem to happen unconditionally.

I also checked router_differences_are_cosmetic(), and it seems to
handle platform string changes correctly.

So maybe the issue is in the descriptor fetching and updating logic?
How many authorities received the new descriptor?
Did any of the other fields in the vote change when the new descriptor
was updated?

Can we get logs from the relays that are affected by this issue, so we
can see how many authorities they uploaded to?
Can we get logs from some authorities so we see how they handled the
new descriptor?

It might also help to open a core tor ticket to track this.
Post by nusenu
[0] http://lists.nycbug.org/pipermail/tor-bsd/2018-January/000620.html
[1] https://trac.torproject.org/projects/tor/ticket/22488#comment:11
[2] https://trac.torproject.org/projects/tor/attachment/ticket/22488/task-22488-relay-versions.csv.gz
nusenu
2018-01-11 00:26:00 UTC
Permalink
Post by teor
I've looked at the Tor source code that handles versions. Version
parsing and voting seem to happen unconditionally.
I also checked router_differences_are_cosmetic(), and it seems to
handle platform string changes correctly.
I'll need to read a spec to know what "correctly" exactly means here.
Post by teor
So maybe the issue is in the descriptor fetching and updating logic?
How many authorities received the new descriptor?
Did any of the other fields in the vote change when the new descriptor
was updated?
Can we get logs from the relays that are affected by this issue, so we
can see how many authorities they uploaded to?
Can we get logs from some authorities so we see how they handled the
new descriptor?
I'll collect affected relays on the trac ticket.
Post by teor
It might also help to open a core tor ticket to track this
https://trac.torproject.org/projects/tor/ticket/24864
--
https://mastodon.social/@nusenu
twitter: @nusenu_
Loading...