TTM data storage & processing
Design Note
This note describes the processing steps in data analysis and
the organisation of data storage for the TestTraffic project.
Data types
In the project we distinguish three types of files with
measurement data:
-
raw_data
-
the immediate results of the measurement process are stored on the
(remote) testboxes in files starting with the tag RCDP (received data);
format is ASCII. These files record all data received by the box,
send by other TT machines.
A record of all packets send by the box to other machines is kept
in files starting with SNDP (send data) This information can be used
to determine packet loss (test traffic send out but not received by
other box)
In addition there are RVEC* files collecting traceroute information
and GENE* files with some general data (mostly informational messages
concerning clock & GPS)
- processed_data
-
After raw data has been transferred to the NCC, two initial processing
steps takes place:
- new traceroute vectors are merged into the existing routing vector
database (which records in two files VBYE and RBYT which routing
vectors were used at which time)
-
info from RCDP, SNDP, GENE files and the routing vector files
are merged into PCKB (test PaCKet Binary) files. In this step
lost packets are identified, and the most probable route vector
is found from the TestTraffic traceroute database.
Containing all information for single testtraffic measurements,
the PCKB files are the basis for next analysis steps. Because this is
a simple well known data format under our own control, it is best
to keep this intermediate step and not produce ROOT files directly
from the raw data. Should problems arise or format of ROOT files
change, we will not have to redo the (time consuming) merging
of data but simply can restart from the PCKB files.
[Supplementary note (oct '99): the present PCKB files turn out to
be less usefull then anticipated: the files are not portable
between BSDI and Solaris operating systems!]
- ROOT files
-
The ROOT package is the tool used
for actual interactive and batch analysis of Test Traffic data.
The TTree files
created with the package prove to be
very useful: an analysis carried out with one particular root file
can easily be extended over a wider selected range of files.
Thus when it comes to data storage, we can let decisions be
steered by efficiency considerations of hardware (tape storage) and
operating system (size of files and filesystems); the analysis itself hardly
imposes any limits.
Organisation of disk space
From the above it is clear we need at least three different
areas for the different data types. Under a top level tree for
TestTraffic production this would result in a logical structure:
/ncc/ttpro |- raw_data
|
|- processed_data
|
|- root
processed and raw data should be hierarichally ordered in
<year>/<month>/<day> subdirectories;
for example:
raw_data/1999/04/01/RCDP.tt01.ripe.net.19990401-010158
...
/1999/04/22/RCDP.tt23.ripe.net.19990422-090157
being in ascii format, raw_data could be stored compressed in gzip
format; data processing jobs should check file extensions to see if
they possibly first need uncompressing.
After considering pros and cons, it was decided to create a one-to-one
mapping of PCKB files to ROOT files, i.e. one ROOT file holds all data send
by a specific testbox on a specific date. As one of the first steps in analysis is
selection of source and target boxes, the splitting on the filesystem
level significantly reduces the amount of data to read. Consequently,
the storage of ROOT files has the same year/month/day hierarchy as
the raw data and processed data storage.
Note: the above describes a logical structure; using symbolic links,
directories are in fact spread over multiple filesystems. See the
accompanying note on
ginkgo's filesystem organisation
.
2nd Note: since April 2001 kauri is the main analysis
machine; it is authoritative for all data. ginkgo is the development
machine and also the master for software distribution and control.
An hourly rsync process takes care of synchronizing (with some exceptions) the
/ncc/ttpro hierarchy of kauri and ginkgo.
Data processing steps
(also see the pages on the data collection & processing cron job)
- raw_data files are fetched from remote boxes daily
and stored on the analysis machine (kauri) in a dedicated filesystem,
after freeing up space (compress/delete old files). With 32 actively measuring
TT boxes, one day of raw_data (with ~2100 packets on each source-target
combination) takes about 350 MB uncompressed, 50 MB compressed.
With a full mesh of 100 machines
that amount goes up one order of magnitude (3.125**2 = 9.77)
Compressing the files immediately after processing, it should be
possible to keep raw data online for at least two weeks.
-
new data is backed up immediately after arrival: all new files
are combined in a tar archive which is dumped to DDS3 tape.
These tapes should be archived, only one at a time needs to be in the
jukebox. The model for tape handling
is described in a separate note.
-
PCKB files are generated daily for all "new" data. This step should
be able to handle the situation where data from one or more boxes arrives
"late" (machine was down, or otherwise unreachable at the time of
daily data transfer). Since these files are only an intermediate
step in the data processing chain, there is no need to archive
them immediately upon creation. Instead, a regular (bi-weekly?)
dump of the filesystem will be enough. Because after completion
of ROOT file generation, the PCKB files are no longer needed
a compression and purging scheme similar to that used for
raw_data is in place.
-
ROOT files for the standard analysis are also generated daily.
Immediately following succesfull creation, the files are stored
on tape with the same tape handling model as used for raw_data.
Being the basis of further detailed analysis, these files will
stay online (disk or tape jukebox) for at least 6 months.
At the moment there is more than enough space to keep all data
from the past months on disk. However, when we reach the state
that some ROOT data will have to be purged from disk, we need
to implement a retrieve-file-from-tape program that will get
the desired file back from a tape in the jukebox without much
need for (human) operator intervention.
$Id: DataStorage.html,v 1.2 2001/11/23 14:44:47 wilhelm Exp $