Discussion:
[tor-dev] Atlas is not that friendly to Web Archive
Leonid Evdokimov
2018-02-13 14:33:29 UTC
Permalink
Hello!

I've recently found out that new Atlas re-design is not that friendly to
web archive. http://archive.li/ can't properly detect "page loaded"
event that leads to capturing "loading" page[%]. Moreover,
https://web.archive.org/ can't capture #-based links at all, as far as I see.

[%] https://archive.li/https://atlas.torproject.org/%23details/5C3B8FB35A13C508CF65E8499E35755DA098DC93

Ability to archive atlas pages is kinda nice to be able to "cite" some
relay status in some specific date as Atlas has no it's own time machine
and information about relay is purged in a few days after relay going down.
https://archive.li/RzGpJ is better than https://archive.li/JGQRW :-)

I'm not a skilled frontend developer, but maybe trading some Time-to-DOM
making JS loading and onionoo.tpo request synchronous should be
enough to make website friendly for that sort of crawlers... But it's
unclear to me if T2DOM is valuable KPI for Atlas or not :)

What do you think?
--
WBRBW, Leonid Evdokimov, xmpp:***@darkk.net.ru http://darkk.net.ru tel:+79816800702
PGP: 6691 DE6B 4CCD C1C1 76A0 0D4A E1F2 A980 7F50 FAB2
Iain Learmonth
2018-02-13 16:42:23 UTC
Permalink
Hi,
Post by Leonid Evdokimov
I've recently found out that new Atlas re-design is not that friendly to
web archive. http://archive.li/ can't properly detect "page loaded"
event that leads to capturing "loading" page[%]. Moreover,
https://web.archive.org/ can't capture #-based links at all, as far as I see.
This is an interesting point. There is not really any way currently to
link to a relay at a particular point in time. The data itself is
preserved in CollecTor, but not in an easy to consume form.

Capturing rendered pages for later viewing is probably not the most
useful thing that humanity could be doing with its disk drives. The
reason that we currently cannot have a time travel service for Relay
Search is that Onionoo would not be able to handle that amount of data
with its current architecture.

If someone produces a patch that fixes this for Relay Search, I'd be
happy to review it. I haven't yet investigated exactly what would be
required. In the long term though, I would like to fix this issue with a
service that can provide time travel information.

There is also another possible option, which is not quite as pretty but
may do enough to be useful for this purpose, which relates to raw
descriptors. #22026 would create a service for accessing raw
descriptors, which we could perhaps make into a time traveling service
and allow you to have a link to cite a raw descriptor.

Thanks,
Iain.

Loading...