20000808

Introduction
------------

The Traceroute Database is a project to store and use data from a RDBMS in
the context of the TestTraffic Measurements project.

The project has two main parts:

- The updating part consists of reading the RVEC files provided by the test
  boxes and storing them in the Database.
- The querying part allows people to query the database for specific routes
  at specific timestamps.

Responsible Group: softies (Manuel)

Inputs/outputs/External Progs and Files used
--------------------------------------------

Input files: files from /ncc/ttpro/raw_data/

Several types of files are used:
- RVEC files provided by the test boxes, which contain traceroute information.
- SNDP files provided by the test boxes, which contain timing information.
- VBYE and RBYT files used to fill the DB from ASCII files.

We also use the LIST_OF_TESTBOXES file, which maps a boxname to an identifier.

This has to be replaced by a procedural interface, reading the output of
a command (now "/ncc/ttpro/bin/ttconfig -v LIST_OF_TESTBOXES") which is
defined as a C-preprocessor macro in common.h header file.
 

Output: - the output of updatedb is sent to the MySQL database.
        - a CGI interface is provided for querying the DB

OS to run on, memory/disk space/speed limitation
------------------------------------------------

OS: Solaris 2.6

The updating program is very CPU-intensive, it also may use large amounts of
memory after running for a long time.

A large amount of disk space will also be required to store the contents of
the DB.

Time required
-------------

Total time: 6 months

Details:
------

1 - Structure of the Database:

Ranges:

Each box is mapped to an identifier taken from the LIST_OF_TESTBOXES file.
id name IP
1 tt01.ripe.net 193.0.0.4
2 osdorp.ripe.net 193.0.0.2

Routes:
CREATE TABLE Routes (
id int (11) NOT NULL auto_increment,
len int (1) NOT NULL,
crc char (32) NOT NULL,
route TEXT,
PRIMARY KEY (id)
);

id len crc route
43 12 ow4kjke/ee 195.114.80.2 193.112.33.44 ...

The Routes table contains all the routes that were seen at least once.
It is read at initialization and stored in a hash indexed by the MD5 CRC.

Len is the number of hops for a given route.

crc is a unique identifier used to compare routes rapidly.

Route is the route itself, a string composed of IPs in integer format.

This is confusing. I thought that in one of our early meetings, we agreed
the IPs would be stored as integers in binary format (4bytes per IP),
but the actual code tells me IPs are stored in decimal ASCII format.
Because the size of the Routes table is much smaller than that of the
Records table, we can live with the present situation, but it would
still be nice to see in the Specs some of the reasons for the change
and why, if you had to go for ASCII, you didn't pick hexadecimal format.

Also, please add that the routes are taken as-is from the RVEC files,
every hop corresponding to a hop reported by traceroute.

Records:
CREATE TABLE Records (
id int (11) NOT NULL auto_increment,
src int (11) NOT NULL,
dst int (11) NOT NULL,
routeid int (11) NOT NULL,
tstart int (11) NOT NULL,
tend int (11) NOT NULL,
numrec int (11) NOT NULL,
PRIMARY KEY (id)
);

id    src  dst routeid  tstart     tend       numrec
2129  4   47   149      957219451  957219451  1

The Records table is a compressed form of the raw data from RVEC files.
It contains the source and destination box ids, the id of a route,
the first and last timestamp at which a given route was seen, and the number of
occurcences of that route.

This table is filled during the execution of the updating script - all
data is kept in memory as long as the records show src, dst and routeid
to be identical. When they change, a record is written to the DB.

src is the id of the source box, taken from LIST_OF_TESTBOXES
dst is the id of the target box, taken from LIST_OF_TESTBOXES
routeid is the id of the route, from the Routes table.
tstart and tend are the timestamps recorded at the first and last occurence
of this record.
numrec is the number of occurences of this particular record.

2 - Components:

There are several executable programs used in the project:

* updatedb
allows the daily updating of the database from the RVEC files.

Upon initialization, updatedb reads information on ranges (src-dst) from
LIST_OF_TESTBOXES and routes from the Routes table. Two hashes are built
with this information (see get_ranges() and get_routes()).

Updatedb accepts one parameter, a filename. This filename should contain
a list of RVEC files. These files can be in plain text or compressed (.gz)
format. Listed files that do not have a name starting with RVEC are
skipped (this eases daily processing of collected TestTraffic data).

For each of these files, we do:
Read the contents of the file, and for each line, create a recstruct record
to store the contents of the line (process_file()). If we discover a new
route, we insert it in the Routes table.

Then, we take the linked list of recstruct records that we got, and we sort
them by src, dst, and timestamp (sort_records()).

Then we process that list of records (process_records()).

We keep a fullrec variable to store the 'ongoing' records.

For each record:
If the record has the same src, dst, and routeid than fullrec, just increment
the number of records in fullrec and modify the t_end to the new record
timestamp.

If not, write fullrec to the Records table and store the new record in fullrec.

If fullrec is empty, try to see if the last written record in the Records table
with the same src and dst has the same routeid. If so, delete that record,
store the data in fullrec, and update fullrec with the new record.

* fillroutes and fillrecords
These two applications are used to fill the Routes and Records tables with data
taken from ASCII files - VBYE for routes, RBYT for records.

The two applications read the file name from the command-line, and process one
line at a time, filling the DB with a record for each line.

They need only to be executed once, in order to put the DB in an up-to-date
state.

The route ids assigned to the VBYE/RBYT records are preserved in the SQL DB
with the exception of the records with routeid=0 (mysql limitation).
This is not seen as problem by TTM since that particular route is historic and
has not been used for a long long time.
 

* routequery

A CGI interface must also be provided for querying the database from the web.
In order to do that, we used a two-tier system: the back-end of the CGI
remained on one machine, while the front-end was put on the http
server. To get the results, the front-end CGI calls a well-known port on the
machine hosting the back-end, passes a line containing src, dst, t_start and
t_end, and reads all the lines returned, until it finds a line containing a
single '.'.

The results are then reformatted by the front-end CGI and displayed on the
web browser.

The back-end (routequery) performs the following operations:
Retrieve all routes between t_start and t_end, plus the last route starting
before t_start, if it falls within a given time limit.

* transferdb

This application is used to move the old records and routes from the database
to alternated tables.

First, we look for every record older than a provided time range. These
records are deleted from the Records table and added to Old_Records. All
routes that do not appear any more in Records are removed from Routes and
added to Old_routes (remove_routes()).

Finally, all records in Old_Records older than a certain time range are
removed from Old_records.

* testquery

This is the actual querying part. It is used to query which routeid was
used between two boxes at a given timestamp.

Querydb.c contains the actual query function, match_vector(), which was
specified by the TT group.

testquery takes a filename as a command-line parameter. This filename, as for
updatedb, should contain a list of plain text or compressed SNDP files.

First, all data for a given day, with a given source box, is fetched from the
Records table and stored in a hash (see build_data()).

The queries are read from the SNDP files, which contain queries for one day,
from a given box. Every line of the file is read and processed by
match_vector(), using the hash as data source.

After the whole file has been processed, the hash is cleared using
reset_data().

3 - Comments about performance:

Currently it takes about 20 minutes to update the DB with one day of RVEC files
(From tt-ops ticket 23709: "532 files (248407 lines) processed in 1325 seconds").
This is acceptable.

match_vector() performance has been found to be very good. the testquery program
takes between 20 and 30 seconds to assign routeids to 70000 packets send by
one box in a day.

Performance of the CGI script is influenced by several factors: load and speed
of webserver machine, the network, the tracerouteDB machine and the performance
of routequery program itself. Most important is too have minimal delay between
submitting request and obtaining first results in the browser. In general this
performance is acceptable.
 

4 - Other Remarks:

* CRC Calculation:

From the MD5 documentation (RFC 1321):
----
The algorithm takes as input a message of arbitrary length and produces
as output a 128-bit "fingerprint" or "message digest" of the input.
It is conjectured that it is computationally infeasible to produce
two messages having the same message digest, or to produce any
message having a given prespecified target message digest.
----

A 128-bit string can have 2^128 possibilities.

Since the project generates 2.4 million routes per day, assuming that each
route is unique (to make things harder), we may have a maximum of
2400000 * 365 = 876000000 unique routes in one year.

So, the probability of having two different routes with the same crc is:
876000000 / 2^128 = .00000000000000000000000000000257

Or one chance in 3.88E30.

Since most routes appear several times, that probability is actually much
lower.

* Dumping/Backup of the database:

The mysqldump tool should be used to dump the database - the syntax is
described in the documentation of MySQL, but a simple command line can be:

mysqldump [OPTIONS] database [tables] > Textfile
 

* AS path info colleted by RIS

Please add a few lines here on how in future we could link the traceroutes
collected by TTM to the AS paths collected by RIS. Since the RIS database
has to 'expire' data much faster than TTM, perhaps we should copy the relevant
records from RIS to TTM database? What tables would be needed?
And most important, would any thing need to change to the current TTM
Routes & Records tables?
 


$Id: annotated-specs.html,v 1.2 2006/11/04 14:40:49 ruben Exp $