Introduction
------------
The Traceroute
Database is a project to store and use data from a RDBMS in
the context of the TestTraffic
Measurements project.
The project has two main parts:
- The updating part consists of reading the RVEC files
provided by the test
boxes and storing them in the Database.
- The querying part allows people to query the database
for specific routes
at specific timestamps.
Responsible Group: softies (Manuel)
Inputs/outputs/External Progs and Files used
--------------------------------------------
Input files: files from /ncc/ttpro/raw_data/
Several types of files are used:
- RVEC files provided by the test boxes, which contain
traceroute information.
- SNDP files provided by the test boxes, which contain
timing information.
- VBYE and RBYT
files used to fill the DB from ASCII files.
We also use the LIST_OF_TESTBOXES file, which maps a boxname to an identifier.
This has to be replaced by
a procedural interface, reading the output of
a command (now "/ncc/ttpro/bin/ttconfig
-v LIST_OF_TESTBOXES") which is
defined as a C-preprocessor
macro in common.h header file.
Output: - the output of updatedb is sent to the MySQL
database.
- a CGI
interface is provided for querying the DB
OS to run on, memory/disk space/speed limitation
------------------------------------------------
OS: Solaris 2.6
The updating program is very CPU-intensive, it also
may use large amounts of
memory after running for a long time.
A large amount of disk space will also be required
to store the contents of
the DB.
Time required
-------------
Total time: 6 months
Details:
------
1 - Structure of the Database:
Ranges:
Each box is mapped to an identifier taken from the
LIST_OF_TESTBOXES file.
id name IP
1 tt01.ripe.net 193.0.0.4
2 osdorp.ripe.net 193.0.0.2
Routes:
CREATE TABLE Routes (
id int (11) NOT NULL auto_increment,
len int (1) NOT NULL,
crc char (32) NOT NULL,
route TEXT,
PRIMARY KEY (id)
);
id len crc route
43 12 ow4kjke/ee 195.114.80.2 193.112.33.44 ...
The Routes table contains all the routes that were
seen at least once.
It is read at initialization and stored in a hash
indexed by the MD5 CRC.
Len is the number of hops for a given route.
crc is a unique identifier used to compare routes rapidly.
Route is the route itself, a string composed of IPs in integer format.
This is confusing. I thought
that in one of our early meetings, we agreed
the IPs would be stored
as integers in binary format (4bytes per IP),
but the actual code tells
me IPs are stored in decimal ASCII format.
Because the size of the
Routes table is much smaller than that of the
Records table, we can live
with the present situation, but it would
still be nice to see in
the Specs some of the reasons for the change
and why, if you had to go
for ASCII, you didn't pick hexadecimal format.
Also, please add that the
routes are taken as-is from the RVEC files,
every hop corresponding
to a hop reported by traceroute.
Records:
CREATE TABLE Records (
id int (11) NOT NULL auto_increment,
src int (11) NOT NULL,
dst int (11) NOT NULL,
routeid int (11) NOT NULL,
tstart int (11) NOT NULL,
tend int (11) NOT NULL,
numrec int (11) NOT NULL,
PRIMARY KEY (id)
);
id src dst routeid tstart
tend numrec
2129 4 47 149
957219451 957219451 1
The Records table is a compressed form of the raw
data from RVEC files.
It contains the source and
destination box ids, the id of a route,
the first and last timestamp at which a given route
was seen, and the number of
occurcences of that
route.
This table is filled during the execution of the updating
script - all
data is kept in memory as long as the records show
src, dst and routeid
to be identical. When they change, a record is written
to the DB.
src is the id of the source box, taken from LIST_OF_TESTBOXES
dst is the id of the target box, taken from LIST_OF_TESTBOXES
routeid is the id of the route, from the Routes table.
tstart and tend are the timestamps recorded at the
first and last occurence
of this record.
numrec is the number of occurences of this particular
record.
2 - Components:
There are several executable programs used in the project:
* updatedb
allows the daily updating of the database from the
RVEC files.
Upon initialization, updatedb reads information on
ranges (src-dst) from
LIST_OF_TESTBOXES and routes from the Routes table.
Two hashes are built
with this information (see get_ranges() and get_routes()).
Updatedb accepts one parameter, a filename. This filename
should contain
a list of RVEC files. These files can be in plain
text or compressed (.gz)
format. Listed files that
do not have a name starting with RVEC are
skipped (this eases daily
processing of collected TestTraffic data).
For each of these files, we do:
Read the contents of the file, and for each line,
create a recstruct record
to store the contents of the line (process_file()).
If we discover a new
route, we insert it in the Routes table.
Then, we take the linked list of recstruct records
that we got, and we sort
them by src, dst, and timestamp (sort_records()).
Then we process that list of records (process_records()).
We keep a fullrec variable to store the 'ongoing' records.
For each record:
If the record has the same src, dst, and routeid
than fullrec, just increment
the number of records in fullrec and modify the t_end
to the new record
timestamp.
If not, write fullrec to the Records table and store the new record in fullrec.
If fullrec is empty, try to see if the last written
record in the Records table
with the same src and dst has the same routeid. If
so, delete that record,
store the data in fullrec, and update fullrec with
the new record.
* fillroutes and fillrecords
These two applications are used to fill the Routes
and Records tables with data
taken from ASCII files - VBYE for routes, RBYT for
records.
The two applications read the file name from the command-line,
and process one
line at a time, filling the DB with a record for
each line.
They need only to be executed once, in order to put
the DB in an up-to-date
state.
The route ids assigned to
the VBYE/RBYT records are preserved in the SQL DB
with the exception of the
records with routeid=0 (mysql limitation).
This is not seen as problem
by TTM since that particular route is historic and
has not been used for a
long long time.
* routequery
A CGI interface must also be provided for querying
the database from the web.
In order to do that, we used a two-tier system: the
back-end of the CGI
remained on one machine, while the front-end was
put on the http
server. To get the results, the front-end CGI calls
a well-known port on the
machine hosting the back-end, passes a line containing
src, dst, t_start and
t_end, and reads all the lines returned, until it
finds a line containing a
single '.'.
The results are then reformatted by the front-end
CGI and displayed on the
web browser.
The back-end (routequery) performs the following operations:
Retrieve all routes between t_start and t_end, plus
the last route starting
before t_start, if it falls within a given time limit.
* transferdb
This application is used to move the old records and
routes from the database
to alternated tables.
First, we look for every record older than a provided
time range. These
records are deleted from the Records table and added
to Old_Records. All
routes that do not appear any more in Records are
removed from Routes and
added to Old_routes (remove_routes()).
Finally, all records in Old_Records older than a certain
time range are
removed from Old_records.
* testquery
This is the actual querying part. It is used to query
which routeid was
used between two boxes at a given timestamp.
Querydb.c contains the actual query function, match_vector(),
which was
specified by the TT group.
testquery takes a filename as a command-line parameter.
This filename, as for
updatedb, should contain a list of plain text or
compressed SNDP files.
First, all data for a given day, with a given source
box, is fetched from the
Records table and stored in a hash (see build_data()).
The queries are read from the SNDP files, which contain
queries for one day,
from a given box. Every line of the file is read
and processed by
match_vector(), using the hash as data source.
After the whole file has been processed, the hash
is cleared using
reset_data().
3 - Comments about performance:
Currently it takes about
20 minutes to update the DB with one day of RVEC files
(From tt-ops ticket 23709:
"532 files (248407 lines) processed in 1325 seconds").
This is acceptable.
match_vector() performance
has been found to be very good. the testquery program
takes between 20 and 30
seconds to assign routeids to 70000 packets send by
one box in a day.
Performance of the CGI script
is influenced by several factors: load and speed
of webserver machine, the
network, the tracerouteDB machine and the performance
of routequery program itself.
Most important is too have minimal delay between
submitting request and obtaining
first results in the browser. In general this
performance is acceptable.
4 - Other Remarks:
* CRC Calculation:
From the MD5 documentation (RFC 1321):
----
The algorithm takes as input a message of arbitrary
length and produces
as output a 128-bit "fingerprint" or "message digest"
of the input.
It is conjectured that it is computationally infeasible
to produce
two messages having the same message digest, or to
produce any
message having a given prespecified target message
digest.
----
A 128-bit string can have 2^128 possibilities.
Since the project generates 2.4 million routes per
day, assuming that each
route is unique (to make things harder), we may have
a maximum of
2400000 * 365 = 876000000 unique routes in one year.
So, the probability of having two different routes
with the same crc is:
876000000 / 2^128 = .00000000000000000000000000000257
Or one chance in 3.88E30.
Since most routes appear several times, that probability
is actually much
lower.
* Dumping/Backup of the database:
The mysqldump tool should be used to dump the database
- the syntax is
described in the documentation of MySQL, but a simple
command line can be:
mysqldump [OPTIONS] database [tables] > Textfile
* AS path info colleted by RIS
Please add a few lines here
on how in future we could link the traceroutes
collected by TTM to the
AS paths collected by RIS. Since the RIS database
has to 'expire' data much
faster than TTM, perhaps we should copy the relevant
records from RIS to TTM
database? What tables would be needed?
And most important, would
any thing need to change to the current TTM
Routes & Records tables?