Test Traffic Project

SW-SYNC

REQUIREMENTS FOR TTM/RIS NODES CONTROL (SW SYNCHRONIZATION & MANAGEMENT PROJECT)

Overview

The New Projects group requires an efficient solution for the
SW Synchronization & Management of machines for the TTM/RIS project.
This will replace the present mechanism for maintaining the test-boxes.

Our experience shows that, while this mechanism still works,
it is unlikely to scale when the TTM project grows
from 40 to 100 test-boxes in the course of 2000.
Of course, the new solution should have a lifetime much longer than that,
and thus be able to cope with a chain at least an order of magnitude
greater than the current one (we now run 40 test-boxes).

In parallel, the new project RIS was started.
The RIS will install 10 so-called RRC's at Internet Exchanges.
The software on these machines will also have to be maintained.
For simplicity's sake, we prefer a solution that can be replicated for RIS.

The system will run in a distributed environment, where the speed/availability
of connections varies. The implementation should be able to cope with this.
In an office/LAN environment network problems are so rare, that after
recovering it doesn't worth thinking them a second time. In the WAN environment
of TTM & RIS, the exception is to have everything going fine without network
instabilities. Complying with this condition is a strong requirement.

The sub-project to address issues of file transfers will be refered as SW-SYNC.
The rest belongs to management information and control of test-box behavior,
which currently requires action by a human and should be automated further.
The sub-project to address those issues, will be referred as SW-MGMT.

SW-SYNC

SW-SYNC should take into consideration the following:

According to past experience from the TTM project,
we have the following cases when updating SW or controlling test-boxes:
 

  1. Installing a new machine, where the whole tree is distributed to a box

  2. with a basic, temporary configuration and IP-address 193.0.0.5 (or other)
  3. Updating critical system software that can affect a running system (for example /kernel)
  4. Updating user software (for the ttraffic account, other user accounts

  5. on the test-boxes are only for testing/development, it is up to the users to maintain
    those accounts), and non-critical system software (for example, a new perl5 version).
  6. Checking the consistency and integrity of files at remote test-boxes.
Experience shows that 1 and 3 are by far the most frequent cases,
case 4 now only happens at irregular intervals
though we should do this more often.  2 typically happens once a year.

Experience shows that there are 2 distinct cases
when software has to be updated:

The requirements (4) and (B) above can both be met
by running the update job automatically at regular intervals.

Furthermore, we must on average once a week (or whatever,
but frequently is too vague) distribute to the machines new configuration
files regarding the current chain status.

A requirement is that we can distribute the TTM configuration file
(only) as often as we want without too much trouble.  On average
we'll distribute a new configuration once a week, with huge variations.
(We will continue to use the present tool for the time being.)
 

Security & Integrity:

Structure:
  • The machines can belong in a "distribution tree":
  • Development:
    One of the machines -in TTM case that is osdorp- will be used for development.
    It should be possible to exclude certain areas of the system from
    automatic updates, (like /usr/home/ttraffic/src), take regular backups etc.
    The framework that defines how much and when the system is adminstrated
    is refered as development BCP (=Best Current Practice).

    Performance requirements:


    Experiences and other considerations:

    So, a proposal is: