Note: The parameters are still being tuned. The numbers reflect their value at the time this page was created. They may have changed since then. However, this does not affect the principles behind the alarm.
The alarm system consists of 2 programs: LTA and STA, with a few support scripts.
The Long Term Average (LTA) program divides the day into 4 equal periods of 6 hours and for each test-box that is sending data to this box, the LTA program maintains a distribution of the one-way-delays measured during this period. The 4 periods are: 0:00-6:00 GMT, 6:00-12:00 GMT, 12:00-18:00 GMT and 18:00-24:00 GMT. These intervals roughly correspond to night, morning, afternoon and evening in most of the RIPE-NCC service area (and to evening, night, morning and afternoon in the US). By selecting them this way, day-night effects are excluded as much as possible. The LTA program updates the distributions once a day using the data from the previous day. Data older than 30 days is removed from the distributions.
The Short Term Average (STA) program maintains the same distributions but for a much shorter period of only the last 30 minutes.
Every 15 minutes, both the short and long term distributions are parameterized by 3 percentiles: 5%, 50% and 95%. The value of these percentiles shows the fraction of points with a delay than 5%, 50% or 95% respectively, so if the 5%-percentile is 10.0 ms, then 5% of the delays measured between two test-boxes is less than 10 ms, and 95% is more than 10 ms. The code then compares short term results against the long term results.
If the short term results are above what is expected from the long term average, then an alarm message will be sent to the host of the two (sending and receiving) test-box. The format of the message is explained here. The time constants of 30 and 15 minutes mean that an unusual condition has to exist for 15 to 30 minutes before an alarm is raised. A typical situation is shown in this figure:
The figure shows the delay between two test-boxes for a 24 hour period. Around -15 (or 9am) something caused the delays to go up from 30ms to about 100ms. This created an alarm message similar to the one shown here.
The programs maintain state, so when the alarm condition disappears, the hosts of the two test-boxes will receive another message. In the figure above, this happened at 11am. Other conditions where the short term average differs from the long term average are recorded but no error message will be sent.
The alarm conditions are shown in the figure above. There are 9 cases:
When the alarm on the test-box is triggered (that is, the STA program returned "Alarm" for a pair of test-boxes), a message like this will be sent to the hosts of the pair of test-boxes that generated the alarm:
Date: Mon, 27 Sep 1999 00:05:16 GMT From: ttraffic@ripe.net To: tt-ops@ripe.net Subject: Testbox ALARM SET The testbox alarm program on tt01.ripe.net found: TB 23 at 938390715 ALARM SET old: Go Up, new: Alarm TB 23 at 938390715 Long: 518 59.0/ 66.5/225.0 Short: 36 238.0/277.5/299.0 Satellite conditions on tt01.ripe.net: Mon Sep 27 00:02:20 1999: Satellites seen from 19990926 225900 to 235900: 0 0 0 0 21 29 7 0 0 0 This message has been sent to: tt23 tt01 For an explanation of this email please see http://www.ripe.net/test-traffic/Host_testbox/alarm.html |
The Subject: line tells if an alarm has been set or reset. The line "The testbox alarm..." shows which test-box generated the alarm. Note that each host will receive message for both data sent to his box and data originating from his box. In the latter case, the alarm will be sent by the receiving test-box. The following lines tell what happened:
In this particular case, data sent from tt23 (test-box #23) to tt01
(test-box #1) generated an alarm. The delay distribution over the last 30
days, had a lower percentile of 59.0 ms, a median of 66.5 and a 95% of
225.0. In the last 30 minutes, these numbers climbed to 238.0, 277.5 and
299.0 respectively. Since this above what was expected, the alarm was set.
To find out where tt01 tt23 are located, click here. (This page is only accessible to sites actually hosting a
test-box)
The line Satellite conditions shows the satellite conditions for the receiving test-box. This information is intended to verify the alarm. The next line (Mon Sep 27...) shows:
So, in the example above (0 0 0 0 21 29 7 0 0 0), there were 21 samples where the receiver saw 4 satellites, 29 where it saw 5 and 7 samples where it saw 6 satellites.
This information can be used to verify the alarm. If ALL entries are in first bin (60 0 0 0 0 0 0 0 0) then the alarm might be a false alarm caused by a drifting clock rather than a network problem. Note that a FEW, up to about 1 out of 3, entries in the lowest bin is no reason for concern.
If you suspect that the alarms are caused by a drifting clock, then you may want to look at the raw satellite data.
When the alarm condition disappears, the host will receive a second email. This looks like:
Date: Mon, 27 Sep 1999 00:20:18 GMT From: ttraffic@ripe.net To: tt-ops@ripe.net Subject: Testbox ALARM RESET The testbox alarm program on tt01.ripe.net found: TB 23 at 938391617 ALARM RESET old: Alarm, new: Go Up TB 23 at 938391617 Long: 518 59.0/ 66.5/225.0 Short: 37 216.0/272.5/299.0 Satellite conditions on tt01.ripe.net: Mon Sep 27 00:02:20 1999: Satellites seen from 19990926 225900 to 235900: 0 0 0 0 21 29 7 0 0 0 This message has been sent to: tt23 tt01 For an explanation of this email please see http://www.ripe.net/test-traffic/Host_testbox/alarm.html |
The format is almost the same, except that the subject says that the alarm
is reset.
Who gets email from the alarm program?
By default, the alarm messages are sent to the contact person(s) for the
test-box. If you want this to be changed, please contact the test-box operators.
More information.
For more details about this program, please contact the test-box operators.