Test Traffic Project
SW-SYNC
REQUIREMENTS FOR TTM/RIS NODES CONTROL (SW SYNCHRONIZATION & MANAGEMENT
PROJECT)
Overview
The New Projects group requires an efficient solution for the
SW Synchronization & Management of machines for the TTM/RIS project.
This will replace the present mechanism for maintaining the test-boxes.
Our experience shows that, while this mechanism still works,
it is unlikely to scale when the TTM project grows
from 40 to 100 test-boxes in the course of 2000.
Of course, the new solution should have a lifetime much longer than
that,
and thus be able to cope with a chain at least an order of magnitude
greater than the current one (we now run 40 test-boxes).
In parallel, the new project RIS was started.
The RIS will install 10 so-called RRC's at Internet Exchanges.
The software on these machines will also have to be maintained.
For simplicity's sake, we prefer a solution that can be replicated
for RIS.
The system will run in a distributed environment, where the speed/availability
of connections varies. The implementation should be able to cope with
this.
In an office/LAN environment network problems are so rare, that after
recovering it doesn't worth thinking them a second time. In the WAN
environment
of TTM & RIS, the exception is to have everything going fine without
network
instabilities. Complying with this condition is a strong requirement.
The sub-project to address issues of file transfers will be refered
as SW-SYNC.
The rest belongs to management information and control of test-box
behavior,
which currently requires action by a human and should be automated
further.
The sub-project to address those issues, will be referred as SW-MGMT.
SW-SYNC
SW-SYNC should take into consideration the following:
According to past experience from the TTM project,
we have the following cases when updating SW or controlling test-boxes:
-
Installing a new machine, where the whole tree is distributed to a box
with a basic, temporary configuration and IP-address 193.0.0.5 (or
other)
-
Updating critical system software that can affect a running system (for
example /kernel)
-
Updating user software (for the ttraffic account, other user accounts
on the test-boxes are only for testing/development, it is up to the
users to maintain
those accounts), and non-critical system software (for example, a new
perl5 version).
-
Checking the consistency and integrity of files at remote test-boxes.
Experience shows that 1 and 3 are by far the most frequent cases,
case 4 now only happens at irregular intervals
though we should do this more often. 2 typically happens once
a year.
Experience shows that there are 2 distinct cases
when software has to be updated:
-
A) As soon as possible.
For example, when a bug is discovered that has to be fixed immediately.
-
B) With the next update.
For example, when a small feature is added to the software.
From this requirement, it follows that we should be able to fire up
the
update script by hand as well as have it in a cron-job (or similar).
The requirements (4) and (B) above can both be met
by running the update job automatically at regular intervals.
Furthermore, we must on average once a week (or whatever,
but frequently is too vague) distribute to the machines new configuration
files regarding the current chain status.
A requirement is that we can distribute the TTM configuration file
(only) as often as we want without too much trouble. On average
we'll distribute a new configuration once a week, with huge variations.
(We will continue to use the present tool for the time being.)
Security & Integrity:
-
Strong security
Authentication of RIPE NCC entity is necessary in order to be sure
that the
test-box is indeed running the software that we expect there to be
installed.
-
There should be provision for maintaining the integrity of the test-box,
when
the transport mechanism fails (usually due to unavailability of network).
This should be checked when the machine comes back online,
as well as at regular intervals.
Structure:
The machines can belong in a "distribution tree":
-
proto_test_box
-
proto_test_box_2
-
rrc
-
...
-
The main distribution trees should be on a disk that is in the regular
backup
scheme of the RIPE NCC. The disk should be, in order to guarantee
that
the software cannot be hacked, inside the NCC firewall.
-
There is a small set of files (mainly in /etc/, ~/ttraffic/config) that
are
machine specific.
Development:
One of the machines -in TTM case that is osdorp- will be used for development.
It should be possible to exclude certain areas of the system from
automatic updates, (like /usr/home/ttraffic/src), take regular backups
etc.
The framework that defines how much and when the system is adminstrated
is refered as development BCP (=Best Current Practice).
Performance requirements:
-
The system should be able to do a check if all files are still up to date
(say, doing the check-sums) in 3 hours or less for the present network,
using
parallelism as much as possible.
-
The process should run automatically, in a cron-job or similar, reporting
only
when something fails. When the rare update case 2 is in progress,
this job can be switched off by hand.
Experiences and other considerations:
-
The current mechanism is based on rdist6 and is able to change/compare
a
particular set of files, for a given list of test-boxes. The list of
files
to-be-updated has overgrown during last months, and comparing the full
file-systems takes exceptional effort and time.
-
Occasionally boxes turn out to be unreachable, or even worse, the updates
are stopped halfway because of connectivity problems. The used mechanism
becomes slow and inappropriate for TTM maintenance.
Thus, we require that the new system can run automatically and will
recover
from network interruptions, even while the update process in progress.
This will enable us to do regular consistency checks as well distribute
non-urgent updates automatically. see SW-MGMT
-
Changing the current setup is a large-scale procedure and includes a non-zero
risk of some tool failing (including third parties' tools); Also, up
to now
there's no complete documentation overview of the customized FreeBSD
distributions and since they have been maintained by different people
at
different times, there's always a risk that a small change can be fatal
for
the systems (for example, by changing uid/gid of system libraries).
So, a proposal is:
-
In order to avoid similar problems in the future, which files are updated
should be logged by the tool.
-
We'll create a proved installation mechanism that can work remotely, either
with the help of a floppy or by installing the second disk.
-
Once the above is in place, we can update our systems in a safe manner,
since if something goes wrong, we can always easily recreate any machine.
-
In the future, we might change the way test-boxes are configured.
It should
be possible that the software tree can be distributed to a machine
with a
temporary setup at an address specified by us, which can be outside
the NCC.