TTM data storage & processing

Design Note

This note describes the processing steps in data analysis and the organisation of data storage for the TestTraffic project.  

Data types

In the project we distinguish three types of files with measurement data:

raw_data
the immediate results of the measurement process are stored on the (remote) testboxes in files starting with the tag RCDP (received data); format is ASCII. These files record all data received by the box, send by other TT machines.

A record of all packets send by the box to other machines is kept in files starting with SNDP (send data) This information can be used to determine packet loss (test traffic send out but not received by other box)

In addition there are RVEC* files collecting traceroute information and GENE* files with some general data (mostly informational messages concerning clock & GPS)

processed_data
After raw data has been transferred to the NCC, two initial processing steps takes place:
  1. new traceroute vectors are merged into the existing routing vector database (which records in two files VBYE and RBYT which routing vectors were used at which time)
  2. info from RCDP, SNDP, GENE files and the routing vector files are merged into PCKB (test PaCKet Binary) files. In this step lost packets are identified, and the most probable route vector is found from the TestTraffic traceroute database.

Containing all information for single testtraffic measurements, the PCKB files are the basis for next analysis steps. Because this is a simple well known data format under our own control, it is best to keep this intermediate step and not produce ROOT files directly from the raw data. Should problems arise or format of ROOT files change, we will not have to redo the (time consuming) merging of data but simply can restart from the PCKB files.

[Supplementary note (oct '99): the present PCKB files turn out to be less usefull then anticipated: the files are not portable between BSDI and Solaris operating systems!]

ROOT files
The ROOT package is the tool used for actual interactive and batch analysis of Test Traffic data. The TTree files created with the package prove to be very useful: an analysis carried out with one particular root file can easily be extended over a wider selected range of files. Thus when it comes to data storage, we can let decisions be steered by efficiency considerations of hardware (tape storage) and operating system (size of files and filesystems); the analysis itself hardly imposes any limits.
 

Organisation of disk space

From the above it is clear we need at least three different areas for the different data types. Under a top level tree for TestTraffic production this would result in a logical structure:

	/ncc/ttpro |- raw_data
                   |
                   |- processed_data
                   |
                   |- root

processed and raw data should be hierarichally ordered in <year>/<month>/<day> subdirectories;
for example:

              raw_data/1999/04/01/RCDP.tt01.ripe.net.19990401-010158
                       ...
                      /1999/04/22/RCDP.tt23.ripe.net.19990422-090157

being in ascii format, raw_data could be stored compressed in gzip format; data processing jobs should check file extensions to see if they possibly first need uncompressing.

After considering pros and cons, it was decided to create a one-to-one mapping of PCKB files to ROOT files, i.e. one ROOT file holds all data send by a specific testbox on a specific date. As one of the first steps in analysis is selection of source and target boxes, the splitting on the filesystem level significantly reduces the amount of data to read. Consequently, the storage of ROOT files has the same year/month/day hierarchy as the raw data and processed data storage.

Note: the above describes a logical structure; using symbolic links, directories are in fact spread over multiple filesystems. See the accompanying note on ginkgo's filesystem organisation .

2nd Note: since April 2001 kauri is the main analysis machine; it is authoritative for all data. ginkgo is the development machine and also the master for software distribution and control. An hourly rsync process takes care of synchronizing (with some exceptions) the /ncc/ttpro hierarchy of kauri and ginkgo.

 

Data processing steps

(also see the pages on the data collection & processing cron job)

$Id: DataStorage.html,v 1.2 2001/11/23 14:44:47 wilhelm Exp $