Discussion:
[tor-dev] connectivity failure for top 100 relays
dawuud
2018-03-13 02:55:12 UTC
Permalink
Out of 9900 possible two hop tor circuits among the top 100 tor relays
only 935 circuit builds have succeeded. This is way worse than the last
time I sent a report 6 months ago during the Montreal tor dev meeting.


Here's the scanner I use:

https://github.com/david415/tor_partition_scanner

(I was planning on improving this testing methodology in collaboration with
Katharina Kohls but was unable to travel to Bochum University because of
visa limitations. It was either go to tor-dev meeting or Bochum but not both.)

Here's the gist of my simple testing methodology:

https://gist.github.com/david415/9875821652018431dd6d6c4407bb90c0#file-detect_tor_network_partitions

Here's exactly how I performed the scan to get those results:

wget https://collector.torproject.org/recent/relay-descriptors/consensuses/2018-03-13-01-00-00-consensus

./helpers/query_fingerprints_from_consensus_file.py 2018-03-1
3-01-00-00-consensus > top100.relays


detect_partitions.py --tor-control tcp:127.0.0.1:9051 --log-dir ./ --status-log ./status_log \
--relay-list top100.relays --secret secretTorEmpireOfRelays --partitions 1 --this-partition 0 \
--build-duration .25 --circuit-timeout 60 --log-chunk-size 1000 --max-concurrency 100


echo "select first_hop, second_hop from scan_log where status = 'failure';" | sqlite3 scan1.db | wc -l
8942

echo "select first_hop, second_hop from scan_log where status = 'timeout';" | sqlite3 scan1.db | wc -l
23

echo "select first_hop, second_hop from scan_log where status = 'success';" | sqlite3 scan1.db | wc -l
935
teor
2018-03-13 07:43:49 UTC
Permalink
Post by dawuud
Out of 9900 possible two hop tor circuits among the top 100 tor relays
only 935 circuit builds have succeeded. This is way worse than the last
time I sent a report 6 months ago during the Montreal tor dev meeting.
How much worse?

And where did you scan *from*?
(It's hard to interpret the results without the latency and quality of your
client connection.)

Also, we have just deployed defences to exactly this kind of rapid circuit
or connection building by a single client. I wonder if your client triggered
those defences. The circuit defences would likely cause timeouts, and
the connection defences would likely cause failures.

I also wonder if your client triggered custom defences on some relays.
Post by dawuud
https://github.com/david415/tor_partition_scanner

https://gist.github.com/david415/9875821652018431dd6d6c4407bb90c0#file-detect_tor_network_partitions
wget https://collector.torproject.org/recent/relay-descriptors/consensuses/2018-03-13-01-00-00-consensus
./helpers/query_fingerprints_from_consensus_file.py 2018-03-1
3-01-00-00-consensus > top100.relays
detect_partitions.py --tor-control tcp:127.0.0.1:9051 --log-dir ./ --status-log ./status_log \
--relay-list top100.relays --secret secretTorEmpireOfRelays --partitions 1 --this-partition 0 \
--build-duration .25 --circuit-timeout 60 --log-chunk-size 1000 --max-concurrency 100
You might get better results if you scan more slowly.
Try to stay under 1 circuit every 3 seconds to each relay from
your IP address. Try to stay under 50 connections to the same
relay from your IP address.

I'm going from memory, check the Tor man page, dir-spec, and
the consensus for the latest DDoS parameter values.

T
dawuud
2018-03-13 13:24:03 UTC
Permalink
Post by teor
How much worse?
During the Montreal tor dev meeting I counted 1947 circuit build failures.
https://lists.torproject.org/pipermail/tor-project/2017-October/001492.html
Post by teor
And where did you scan *from*?
I scaned from a server in the Netherlands.
Post by teor
(It's hard to interpret the results without the latency and quality of your
client connection.)
I can record latency. What do you mean by quality? I mean... I'm not using these
circuits to actually send and receive stuff.
Post by teor
Also, we have just deployed defences to exactly this kind of rapid circuit
or connection building by a single client. I wonder if your client triggered
those defences. The circuit defences would likely cause timeouts, and
the connection defences would likely cause failures.
aha! That might explain the terrible results, hopefully it's not that network
health has gotten worse in the last six months.
Post by teor
I also wonder if your client triggered custom defences on some relays.
I doubt it. I am not making sequential circuits to the same relays. The
relays choosen for each circuit builds are generated from a shuffle.
Post by teor
You might get better results if you scan more slowly.
Try to stay under 1 circuit every 3 seconds to each relay from
OK. I will try this. The scan will take longer but hopefully produce
more accurate and useful results.
Post by teor
your IP address. Try to stay under 50 connections to the same
relay from your IP address.
hmm OK. I can limit the number of concurrenct circuits that are being
built but I do not believe that txtorcon let's me control the number
of "connections" that little-t tor makes.
Post by teor
I'm going from memory, check the Tor man page, dir-spec, and
the consensus for the latest DDoS parameter values.
meejah
2018-03-13 15:01:01 UTC
Permalink
Post by dawuud
Post by teor
your IP address. Try to stay under 50 connections to the same
relay from your IP address.
hmm OK. I can limit the number of concurrenct circuits that are being
built but I do not believe that txtorcon let's me control the number
of "connections" that little-t tor makes.
I *think* they should be equivalent? Controllers can't control
everything Tor does, though (for example, Tor can decide to set up
circuits to fetch things or do its own measurements).

Related to this might be my own scanner; I keep 20 circuits in-flight at
any one time and am using random guards so it's "very unlikely" I'd even
have two connections to the same first hop at the same time. However, I
don't do anything about timing -- maybe we can take up this discussion
in an IRC channel?
--
meejah
dawuud
2018-03-13 14:48:12 UTC
Permalink
Post by teor
And where did you scan *from*?
(It's hard to interpret the results without the latency and quality of your
client connection.)
It turns out I am recording circuit build latency. It is unclear to
me exactly what you'd like me to do with this information however
here's a some silly queries:

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration > 50 AND duration < 60;" | sqlite3 scan1.db
55.2818120117187
51.7696379394531
59.9406301269531

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration > 40 AND duration < 50;" | sqlite3 scan1.db
41.0546398925781
40.1456608886719
48.2474660644531

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration > 30 AND duration < 40;" | sqlite3 scan1.db
31.6949631347656
34.8123491210938
37.0733110351563
36.2936791992188

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration > 20 AND duration < 30;" | sqlite3 scan1.db
29.2628620605469
28.2720109863281

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration > 10 AND duration < 20;" | sqlite3 scan1.db
13.4959392089844
14.6635520019531
19.32987109375
14.2355910644531
13.9277241210937
13.3795317382812
12.9024929199219
12.3480061035156
11.711751953125
10.2423110351563
11.0780610351562
18.3046040039062

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration > 3 AND duration < 10;" | sqlite3 scan1.db
8.98835498046875
3.93438012695312
4.10946020507812
9.21181396484375
8.1195078125
6.78396508789062
5.28444775390625
3.59763989257813

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration > 1 AND duration < 3;" | sqlite3 scan1.db
2.05169384765625
1.69050805664062
1.86933813476563
2.22057397460937
1.82368383789063
2.53436987304688
1.80827685546875

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration < 1;" | sqlite3 scan1.db | wc -l
9837
meejah
2018-03-13 15:52:14 UTC
Permalink
Post by teor
And where did you scan *from*?
(It's hard to interpret the results without the latency and quality of your
client connection.)
If I correctly understand what David's scanner is doing, so long as "a"
connection can make it to the first hop properly any other failure is
"the Tor network's fault", isn't it? (I mean, unless the first-hop
connection is so crappy it sometimes just times out or drops).

To me the important thing here would be to do the scans consistently
from the same network-vantage point and then at least subsequent scans
can be compared more consistently (right?).

For my scans (which are 3-hop) I re-try failing combinations up to 5
times before completely giving up -- but still fail to scan a bunch of
relays. These tests *do* fetch real data, though, so there's a lot more
opportunity for "bad things" to happen which aren't a problem of "the
Tor network" necessarily.
--
meejah
Roger Dingledine
2018-03-13 08:47:51 UTC
Permalink
Post by dawuud
Out of 9900 possible two hop tor circuits among the top 100 tor relays
only 935 circuit builds have succeeded. This is way worse than the last
time I sent a report 6 months ago during the Montreal tor dev meeting.
The next step here would be to try to debug your results, to understand
if it's actually an issue with the Tor network (in which case, what
exactly is the issue), or if it's a bug in your scripts.

Teor asked some good questions.

Other questions I'd want to investigate:

(A) Are the failures consistent, or intermittent? That is, does a
failed link always fail, or only sometimes?

(B) Are you really sure that it failed? I would guess that 'failed'
is different from 'timeout' because it got an explicit destroy back?
If so, don't destroy cells have 'reason' components? Which reasons are
happening most commonly?

(C) We should find a link that is failing between two relays that we
both control, and look at each one more closely to see if there are any
hints. For example, is there anything in the logs? If we turn up the
logging, do we get any hints then?

(D) ...which leads to: we should run this same tool on the test network
that teor and dgoulet et al run, and look for failures there. Assuming we
find some, since there are no users on the test network, we can investigate
much more thoroughly.

(E) I wonder if there's a correlation between the failed links and
whether a TLS connection is already established on that link. That is,
when there is no connection already, there are many more steps that
need to be taken to extend the circuit, and those steps could lead to
increased failure rates, either due to the extra time that is needed,
or because part of tor's link handshake (NETINFO, etc) is going wrong.

And a last point: this tool, and these investigations, are exactly in
scope for the "network health" topic that the network team has been
discussing as one of the key open areas that need more attention.

--Roger
dawuud
2018-03-13 13:45:56 UTC
Permalink
Post by Roger Dingledine
(A) Are the failures consistent, or intermittent? That is, does a
failed link always fail, or only sometimes?
Yes this is what our new testing methodology should support.
My current scanner is not sufficient. We want to improve it.
Post by Roger Dingledine
(B) Are you really sure that it failed? I would guess that 'failed'
is different from 'timeout' because it got an explicit destroy back?
If so, don't destroy cells have 'reason' components? Which reasons are
happening most commonly?
Yes I am sure it failed. It would be cool if txtorcon can expose the
'reason' but I think that it cannot. I suppose it will show up in the
tor log file if I set it to debug logging.
Post by Roger Dingledine
(C) We should find a link that is failing between two relays that we
both control, and look at each one more closely to see if there are any
hints. For example, is there anything in the logs? If we turn up the
logging, do we get any hints then?
Sounds good. I would certainly be willing to collaborate with Teor or anyone
else who might like to help with this.
Post by Roger Dingledine
(D) ...which leads to: we should run this same tool on the test network
that teor and dgoulet et al run, and look for failures there. Assuming we
find some, since there are no users on the test network, we can investigate
much more thoroughly.
Sounds good. Let me know if there is anything I can do to help with this.
Post by Roger Dingledine
(E) I wonder if there's a correlation between the failed links and
whether a TLS connection is already established on that link. That is,
when there is no connection already, there are many more steps that
need to be taken to extend the circuit, and those steps could lead to
increased failure rates, either due to the extra time that is needed,
or because part of tor's link handshake (NETINFO, etc) is going wrong.
Ah yes this is another good question for which I currently do not have an answer.
meejah
2018-03-13 15:11:53 UTC
Permalink
Post by dawuud
Yes I am sure it failed. It would be cool if txtorcon can expose the
'reason' but I think that it cannot. I suppose it will show up in the
tor log file if I set it to debug logging.
txtorcon does expose both the 'reason' and the 'remote_reason' flags
returned by the failure messages. In fact, it returns all flags that Tor
sent during stream or circuit failures.

The **kwargs in stream_closed, circuit_closed or circuit_failed
notifications should all include "REASON" and many times will also
include "REMOTE_REASON" (e.g. if the "other" relay closed the
connection). For convenience, txtorcon also includes lower-cased
versions of all the flags.
Post by dawuud
Post by Roger Dingledine
(C) We should find a link that is failing between two relays that we
both control, and look at each one more closely to see if there are any
hints. For example, is there anything in the logs? If we turn up the
logging, do we get any hints then?
Sounds good. I would certainly be willing to collaborate with Teor or anyone
else who might like to help with this.
I'm +1 here too. I'd like to better understand the failures I see in my
scanner as well.
Post by dawuud
Post by Roger Dingledine
(E) I wonder if there's a correlation between the failed links and
whether a TLS connection is already established on that link. That
is, when there is no connection already, there are many more steps
that need to be taken to extend the circuit, and those steps could
lead to increased failure rates, either due to the extra time that is
needed, or because part of tor's link handshake (NETINFO, etc) is
going wrong.
Ah yes this is another good question for which I currently do not have an answer.
Would it be better, then, to pick one first hop and scan (sequentially)
every second-hop using that first hop? (And maybe have say 5 or 10 such
things going on at once?)
--
meejah
dawuud
2018-03-13 23:48:30 UTC
Permalink
I did another scan, this time with 3 seconds between each circuit
build and set the max connections to 50 with similar results as
yesterday:

9354 failure
2 timeout
544 success

most of the circuit build failures happened in under a second:

echo "select (end_time - start_time) / 1000 as duration from scan_log where duration < 1 AND status = 'failure';" | sqlite3 scan1.db | wc -l
9344
Post by meejah
txtorcon does expose both the 'reason' and the 'remote_reason' flags
returned by the failure messages. In fact, it returns all flags that Tor
sent during stream or circuit failures.
The **kwargs in stream_closed, circuit_closed or circuit_failed
notifications should all include "REASON" and many times will also
include "REMOTE_REASON" (e.g. if the "other" relay closed the
connection). For convenience, txtorcon also includes lower-cased
versions of all the flags.
ah ok! I will take a look at this. I'd like to do another scan
while collecting this additional information.
Post by meejah
Would it be better, then, to pick one first hop and scan (sequentially)
every second-hop using that first hop? (And maybe have say 5 or 10 such
things going on at once?)
Maybe it's ok to make 7,000+ tor circuits sequentially from the same relay
if it's done very slowly?
dawuud
2018-04-27 21:12:59 UTC
Permalink
Greetings,

(
Meejah and I made txtorcon report the reason for circuit
build failures here: https://github.com/meejah/txtorcon/pull/299
My scanner now uses this txtorcon feature:
https://github.com/david415/tor_partition_scanner
)

I used a collector consensus file: 2018-04-27-19-00-00-consensus

wget https://collector.torproject.org/recent/relay-descriptors/consensuses/2018-04-27-19-00-00-consensus

and extracted the top 100 relays with the highest consensus weights
with stable AND fast flags.

./helpers/query_fingerprints_from_consensus_file.py 2018-04-27-19-00-00-consensus > top100.relays

and then performed the scan, building 9900 2-hop tor circuits:

detect_partitions.py --tor-control unix:/var/run/tor/control --log-dir ./ --status-log ./status_log \
--relay-list top100.relays --secret secretTorEmpireOfRelays --partitions 1 --this-partition 0 \
--build-duration .25 --circuit-timeout 60 --log-chunk-size 1000 --max-concurrency 100

This resulted in only 307 circuit build failures:

echo "select reason from scan_log where status = 'failure'
Post by dawuud
;" | sqlite3 scan1.db | wc -l
307

And out of these failures, 301 of them the circuit build failure REASON was reported by little-t tor as TIMEOUT:

echo "select reason from scan_log where status = 'failure';" | sqlite3 scan1.db | grep -i timeout | wc -l
301

Here's the non-timeout REASONs for these circuit build failures:

echo "select reason from scan_log where status = 'failure';" | sqlite3 scan1.db | grep -vi timeout

DESTROYED, FINISHED
DESTROYED, FINISHED
DESTROYED, CHANNEL_CLOSED
DESTROYED, CHANNEL_CLOSED
DESTROYED, CHANNEL_CLOSED
DESTROYED, CHANNEL_CLOSED


I'm curious to try this scan at different times of day to see if results vary.


Cheers,

David
Post by dawuud
I did another scan, this time with 3 seconds between each circuit
build and set the max connections to 50 with similar results as
9354 failure
2 timeout
544 success
echo "select (end_time - start_time) / 1000 as duration from scan_log where duration < 1 AND status = 'failure';" | sqlite3 scan1.db | wc -l
9344
Post by meejah
txtorcon does expose both the 'reason' and the 'remote_reason' flags
returned by the failure messages. In fact, it returns all flags that Tor
sent during stream or circuit failures.
The **kwargs in stream_closed, circuit_closed or circuit_failed
notifications should all include "REASON" and many times will also
include "REMOTE_REASON" (e.g. if the "other" relay closed the
connection). For convenience, txtorcon also includes lower-cased
versions of all the flags.
ah ok! I will take a look at this. I'd like to do another scan
while collecting this additional information.
Post by meejah
Would it be better, then, to pick one first hop and scan (sequentially)
every second-hop using that first hop? (And maybe have say 5 or 10 such
things going on at once?)
Maybe it's ok to make 7,000+ tor circuits sequentially from the same relay
if it's done very slowly?
_______________________________________________
tor-dev mailing list
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
dawuud
2018-05-02 20:00:26 UTC
Permalink
I think that many of my previous scans were not useful and
showed inaccurate results because the IP address i was scanning
from might have gotten black listed by dir-auths? or perhaps blocked
by many relays by the anti-denial-of-service mechanisms in tor?
i got rid of that virtual server and lost use of it's IP address... so we'll never know.

Katharina and I are interested in doing lots more thorough scans of
the Tor network rather than this limited methodology i've been using.

What are the guidelines to avoid getting blocked by the tor network?
Is it possible to check the consensus to see if a client IP has been blocked?
Post by dawuud
Greetings,
(
Meejah and I made txtorcon report the reason for circuit
build failures here: https://github.com/meejah/txtorcon/pull/299
https://github.com/david415/tor_partition_scanner
)
I used a collector consensus file: 2018-04-27-19-00-00-consensus
wget https://collector.torproject.org/recent/relay-descriptors/consensuses/2018-04-27-19-00-00-consensus
and extracted the top 100 relays with the highest consensus weights
with stable AND fast flags.
./helpers/query_fingerprints_from_consensus_file.py 2018-04-27-19-00-00-consensus > top100.relays
detect_partitions.py --tor-control unix:/var/run/tor/control --log-dir ./ --status-log ./status_log \
--relay-list top100.relays --secret secretTorEmpireOfRelays --partitions 1 --this-partition 0 \
--build-duration .25 --circuit-timeout 60 --log-chunk-size 1000 --max-concurrency 100
echo "select reason from scan_log where status = 'failure'
Post by dawuud
;" | sqlite3 scan1.db | wc -l
307
echo "select reason from scan_log where status = 'failure';" | sqlite3 scan1.db | grep -i timeout | wc -l
301
echo "select reason from scan_log where status = 'failure';" | sqlite3 scan1.db | grep -vi timeout
DESTROYED, FINISHED
DESTROYED, FINISHED
DESTROYED, CHANNEL_CLOSED
DESTROYED, CHANNEL_CLOSED
DESTROYED, CHANNEL_CLOSED
DESTROYED, CHANNEL_CLOSED
I'm curious to try this scan at different times of day to see if results vary.
Cheers,
David
Post by dawuud
I did another scan, this time with 3 seconds between each circuit
build and set the max connections to 50 with similar results as
9354 failure
2 timeout
544 success
echo "select (end_time - start_time) / 1000 as duration from scan_log where duration < 1 AND status = 'failure';" | sqlite3 scan1.db | wc -l
9344
Post by meejah
txtorcon does expose both the 'reason' and the 'remote_reason' flags
returned by the failure messages. In fact, it returns all flags that Tor
sent during stream or circuit failures.
The **kwargs in stream_closed, circuit_closed or circuit_failed
notifications should all include "REASON" and many times will also
include "REMOTE_REASON" (e.g. if the "other" relay closed the
connection). For convenience, txtorcon also includes lower-cased
versions of all the flags.
ah ok! I will take a look at this. I'd like to do another scan
while collecting this additional information.
Post by meejah
Would it be better, then, to pick one first hop and scan (sequentially)
every second-hop using that first hop? (And maybe have say 5 or 10 such
things going on at once?)
Maybe it's ok to make 7,000+ tor circuits sequentially from the same relay
if it's done very slowly?
_______________________________________________
tor-dev mailing list
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
_______________________________________________
tor-dev mailing list
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
nusenu
2018-05-02 20:41:00 UTC
Permalink
Post by dawuud
I think that many of my previous scans were not useful and
showed inaccurate
I'm glad that it turned out that these previous results might have been inaccurate
(because the results were scary if found to be accurate)
Post by dawuud
results because the IP address i was scanning
from might have gotten black listed by dir-auths?
I don't see how dir auths could blacklist specific client IP addresses
(tor clients use fallbackdirs)
Post by dawuud
or perhaps blocked
by many relays by the anti-denial-of-service mechanisms in tor?
can you let me know the start and end date of the scan (2018-03-12?) so I can check how many of
the relays you scanned (the top 100 relays by cw? at the time)
had a tor version with anti ddos features at the time?

During your first scans (2017) there have been no anti-dos features.
Post by dawuud
i got rid of that virtual server and lost use of it's IP address... so we'll never know.
Katharina and I are interested in doing lots more thorough scans of
the Tor network rather than this limited methodology i've been using.
I'm excited to hear that.
Post by dawuud
What are the guidelines to avoid getting blocked by the tor network?
stay under the public thresholds?
https://www.torproject.org/docs/tor-manual-dev.html.en#_denial_of_service_mitigation_options
Post by dawuud
Is it possible to check the consensus to see if a client IP has been blocked?
the consensus holds information about relays not about tor client IP addresses, but
I assume you know that and I misunderstood your question?
--
https://mastodon.social/@nusenu
twitter: @nusenu_
dawuud
2018-05-02 22:35:22 UTC
Permalink
Post by nusenu
can you let me know the start and end date of the scan (2018-03-12?) so I can check how many of
the relays you scanned (the top 100 relays by cw? at the time)
that scan only took an hour or so to perform and I posted the e-mail
minutes after the scan, so you can refer to the date in the e-mail header ;-)
Post by nusenu
During your first scans (2017) there have been no anti-dos features.
ah yeah that's true and i think we'll see lots of partitions in the
tor network if we continue to scan. although my latest results show
us that at least the top 100 tor relays are OK. we might find that
relays with a lower consensus measurement value might be getting more
traffic than they can handle which in turn causes those relays to drop
new circuit builds. just a theory. the new scan was done from a server
in the US... so i mean... we'll see what happens when we perform scans
from different locations repeatedly at different times of day.
Post by nusenu
Post by dawuud
What are the guidelines to avoid getting blocked by the tor network?
stay under the public thresholds?
https://www.torproject.org/docs/tor-manual-dev.html.en#_denial_of_service_mitigation_options
ah thanks!
Post by nusenu
Post by dawuud
Is it possible to check the consensus to see if a client IP has been blocked?
the consensus holds information about relays not about tor client IP addresses, but
I assume you know that and I misunderstood your question?
hmm i was thinking that there could be a limited blacklist of client IPs but i guess there isn't one.
nevermind then.
teor
2018-05-02 23:28:40 UTC
Permalink
Post by nusenu
Post by dawuud
What are the guidelines to avoid getting blocked by the tor network?
stay under the public thresholds?
https://www.torproject.org/docs/tor-manual-dev.html.en#_denial_of_service_mitigation_options
Those are the defaults.

You'll need to stay under the current thresholds in the consensus:
https://consensus-health.torproject.org/#consensusparams

T

Loading...