Test Traffic Project
SW-MGMT
Introduction
The management actions for TTM and RIS network include:
-
Installation
-
Configuration
-
Fault management
-
Security checks
-
SW consistency checks (local)
Management actions have to be run on both the remote machines as
well as on a central machine at the NCC. On all the machines,
we want to ensure that the necessary processes are up and running,
while from the NCC we want to test that the box is still reachable.
The configuration files belong to so called "management information".
Since we don't have any secure network-wide management system,
it's desirable to use the SW-SYNC
mechanism in combination
with configuration mechanism(s), which:
-
Monitors processes and maintains the correct set of them in the system
-
Implements a network-wide crontab solution, which handles exceptions well
-
Can perform tasks depending on conditions, such as hostname, time etc
-
Does reactionary administration when spotting faults (the so-called states)
For maintainability, configuration information should not be fragmented.
Such a tool has been found; it is named cfengine
(CFE) and is openly developed.
The tool CFE covers the most frequent administration needs;
at the same time it can be easily expanded through custom modules and
scripts
or programs written in any language.
Capabilities
CFE is capable of the following:
-
Implements a network-wide crontab-like configuration, in a flexible fashion
-
Can perform tasks according to conditions, such as hostname, time etc
-
Monitors processes and maintains the correct set of them in the system
-
Performs filesystem maintainance: check/set ownerships, permissions, symlinks
-
Cleaning up the filesystem from core, *old files etc
-
Has Tripwire functionaly, by building a database of MD5 checksums
-
Does reactionary administration when spotting faults
(the so called states: handle the problem, don't be dummy)
Objectives
-
Distributed, local self-monitoring
-
minimal impact on operator
-
by not producing any output when no unusual situations occured.
-
network fault tolerance
-
define "timeouts" and if problem persists then contact tt-ops
-
handle or recover from failures
-
when {updating config|getting data|updating SW}
-
report unexpected changes of the filesystem (not high priority)
-
Auto-restart of system processes such as named, sshd, xntpd, sendmail.
-
Daily/Weekly/Monthly overview mails should be improved.
Much of the functionality in these scripts is delivered
as standard CFE features:
-
checking available disk space
-
tidying the filesystem
-
compressing & rotating /var/log files
-
checking suid files, and diffs between important files
(or MD5 signatures)
-
Ensure remote maintenance of the machines.
We want to ensure that all machines are still reachable from the NCC
and
can be connected to via ssh at regular intervals (a couple of times
a day).
-
Monitoring of TT software, ensuring that all data-taking processes
are running correctly and still produce output; if not then restart
-
managing state of the boxes (SETUP, ON, OFF, WATCH ...)
Requirements relating to SW-MGMT
-
Ability of SW versioning, rollback (undo changes) and logging are desirable;
-
Be able to manage symbolic links in a structured way,
which is useful for SW control between machines with variable exceptions
-
Special action may be required on a box after a file/directory has been
transferred (e.g. restart processes, run newaliases)
-
The second (backup) disk on the test-box contains
a recent copy of the contents of the system disk. The second disk should
be
updated only when there are changes on the system disk (as copying
the disk
takes time and resources), the /data area should be excluded.
This should happen in the order of every week or so.
Example implementation
#!/usr/local/sbin/cfengine -f
control:
domain = ( ripe.net )
sysadm = ( tt-ops@ripe.net )
actionsequence = ( directories tidy )
#example of keeping the filesystem in a desired mode
directories:
TestBoxes::
/usr/sbin mode=755
tidy:
TestBoxes::
#typical system maintenance...
/usr/home/ pat=*.core age=0 r=inf
#next rule replaces 35 lines of perl code in ~ttraffic/bin/cleanup.pl (!!!)
/var/log/xntp pat=*stats* age=30