[tor-dev] onionoo.tpo stuck at 2018-01-21 22:00

Post by nusenu
Hi Karsten,

Hi nusenu,

Post by nusenu
just wanted to let you know that the delta between
relays_published and current time is unusually high.
https://onionoo.torproject.org/details?limit=0
{"version":"5.0",
"build_revision":"0bce98a",
"relays_published":"2018-01-21 22:00:00",
This is currently blocking ornetradar reports.

Looks like the primary CollecTor instance had a problem between 22:00
and 08:00 UTC. It works again now, as does Onionoo.

We didn't lose any data, because the primary CollecTor instance obtained
all descriptors it had missed earlier from the backup CollecTor instance.

Post by nusenu
thanks for having a look,

Thanks for the report!

Post by nusenu
nusenu

All the best,
Karsten

nusenu

2018-01-22 17:57:00 UTC

Post by Karsten Loesing
Looks like the primary CollecTor instance had a problem between 22:00
and 08:00 UTC. It works again now, as does Onionoo.

Karsten, thanks for the fast reaction.

Post by Karsten Loesing
We didn't lose any data, because the primary CollecTor instance obtained
all descriptors it had missed earlier from the backup CollecTor instance.

Since I'm archiving onionoo data I'm "loosing" data (causing blind spots) everytime a "relays_published"
timestamp is skipped. In theory one could spin up an onionoo instance to generate data for skipped
timestamps but in practice this is hard (requires lots of resources).
(I know, you are probably talking about not loosing any raw CollecTor data, but wanted to mention that
nonetheless.)

Do you monitor onionoo for such problems ("relays_published" timestamp remaining unchanged for >1-2 hours)?
Would you find something like that useful?

Thanks for keeping it running besides all the other things you do.

I'm wondering if the admin team would be available to cover such cases to reduce
the operations load for developers.

kind regards,
nusenu

--
https://mastodon.social/@nusenu
twitter: @nusenu_

Karsten Loesing

2018-01-24 14:50:33 UTC

Hi nusenu,

Post by Karsten Loesing
Looks like the primary CollecTor instance had a problem between 22:00
and 08:00 UTC. It works again now, as does Onionoo.

Karsten, thanks for the fast reaction.

Post by Karsten Loesing
We didn't lose any data, because the primary CollecTor instance obtained
all descriptors it had missed earlier from the backup CollecTor instance.

Right, I meant not losing any raw CollecTor data. Your use case of
archiving Onionoo data is special. It's okay that you do this, but it's
not what Onionoo was designed for. Most people will find Onionoo data
that is 6 or 12 hours behind still useful. But if we had lost 6 or 12
hours of CollecTor data, that would have been pretty bad.

What we can do, though, is think about providing more history in
Onionoo, so that you can give up on archiving Onionoo data. After all,
Onionoo already provides quite some history, including graph data like
in bandwidth documents and others, times when a relay last changed its
IP address or port, the time it was first seen, and so on. If you have
ideas what else would be valuable to have history for, please open a ticket.

Post by nusenu
Do you monitor onionoo for such problems ("relays_published" timestamp remaining unchanged for >1-2 hours)?
Would you find something like that useful?

We do have such monitoring, yes. Here's the Nagios script we're using:

https://gitweb.torproject.org/admin/tor-nagios.git/tree/tor-nagios-checks/checks/tor-check-onionoo

Post by nusenu
Thanks for keeping it running besides all the other things you do.
I'm wondering if the admin team would be available to cover such cases to reduce
the operations load for developers.

The admin team already handles operational issues with the hosts, though
the metrics team is still in charge for running the services. I think
that's a fine separation, and it has worked quite well for the last
couple of years.

Post by nusenu
kind regards,
nusenu

All the best,
Karsten

nusenu

2018-01-26 10:34:00 UTC

Post by Karsten Loesing
What we can do, though, is think about providing more history in
Onionoo, so that you can give up on archiving Onionoo data.

It is nice of you to consider that but it is not necessary (at least for me)
I can life with my current hacks and other probably don't need more history,
and you have already enough stuff on your plate.

Post by nusenu
Do you monitor onionoo for such problems ("relays_published" timestamp remaining unchanged for >1-2 hours)?
Would you find something like that useful?

We do have such monitoring, yes.

So my email was redundant to your nagios check?

Would it be possible to publish these alerts on a mailing list? :)

--
https://mastodon.social/@nusenu
twitter: @nusenu_

Karsten Loesing

2018-01-26 10:54:07 UTC

Post by Karsten Loesing
What we can do, though, is think about providing more history in
Onionoo, so that you can give up on archiving Onionoo data.

Okay.

Post by nusenu
Do you monitor onionoo for such problems ("relays_published" timestamp remaining unchanged for >1-2 hours)?
Would you find something like that useful?

We do have such monitoring, yes.

So my email was redundant to your nagios check?
Would it be possible to publish these alerts on a mailing list? :)

Not a crazy idea! I opened a ticket for further discussing this:

https://trac.torproject.org/projects/tor/ticket/25035

All the best,
Karsten

nusenu

2018-02-03 00:32:00 UTC

thanks for looking into it

--
https://mastodon.social/@nusenu
twitter: @nusenu_

Karsten Loesing

2018-02-03 08:10:22 UTC

Post by nusenu
thanks for looking into it

Looks like the CollecTor host is down, along with several other hosts. I
sent mail to the admins.

All the best,
Karsten

nusenu

2018-02-03 11:53:00 UTC

Post by nusenu
thanks for looking into it

Looks like the CollecTor host is down, along with several other hosts. I
sent mail to the admins.

Does that imply that we are actually loosing raw CollecTor data until it comes back?

--
https://mastodon.social/@nusenu
twitter: @nusenu_

Karsten Loesing

2018-02-03 12:10:59 UTC

Post by nusenu
thanks for looking into it

Looks like the CollecTor host is down, along with several other hosts. I
sent mail to the admins.

Does that imply that we are actually loosing raw CollecTor data until it comes back?

No, we still have the backup CollecTor host that downloads Tor
descriptors and that the primary CollecTor host will sync from once it
comes back.

All the best,
Karsten

nusenu

2018-02-03 12:12:00 UTC

Post by nusenu
thanks for looking into it

Looks like the CollecTor host is down, along with several other hosts. I
sent mail to the admins.

Does that imply that we are actually loosing raw CollecTor data until it comes back?

No, we still have the backup CollecTor host that downloads Tor
descriptors and that the primary CollecTor host will sync from once it
comes back.

great!

--
https://mastodon.social/@nusenu
twitter: @nusenu_

nusenu

2018-02-04 15:08:00 UTC