TTM daily processing

cleanup

An essential ingredient of daily processing is ensuring enough free disk space is available to store newly collected and processed data. This is the task of the cleanup script. The requirements for this are:
1. make sure enough free space exists to store all new data.
Although it is difficult to predict the amount of new data that will arrive (or be created), a good approximation is to take the maximum of the space occupied for a single day in the last week and multiply that by two. The targetted amount of freespace to be obtained by cleanup is this number or 5% of the total available space, whichever is larger.

2. keep data on disk as long as possible
compression can help reducing occupied space, thus allowing data for more days to stay on-line
specification

The requirements are met by progressively compressing and removing old files (where old refers to the time of data-taking not the time a file arrived at or was created on the central machine). The main logic of the cleanup script is:

	compress files older than Z days
	purge files older than P days

	while  (freespace < target) AND (Z > minimal-uncompressed-days)
	do
		Z = Z - 1
		compress files older than Z days
		determine new freespace 
	done

	while  (freespace < target) AND (P > minimal-keep-days)
		P = P - 1
		purge files older than P days
		determine new freespace 
	done
where minimal-uncompressed-days is the number of days for which files should remain stored in uncompressed format and minimal-keep-days is the number of days the files must be kept on disk.

These two parameters form hard boundary conditions for cleanup. If the targetted amount of free space cannot be reached within these conditions, the script returns error code 2.

output and return codes

While traversing the (raw or processed) data hierarchy, cleanup collects a list of files and directories whose names do not match the expected patterns for TTM production files; i.e. they do not start with any of the keywords GENE, PCKB, RCDP, RVEC or SNDP. Most often these unexpected files are left-overs from a previous run of the collect_data program. On Feb 2, 2000, for example, cleanup reported:

Files/directories left after cleanup; check out manually:
 
/ncc/ttpro/raw_data/2000/01/28/old.GENE.tt37.ripe.net.20000128
/ncc/ttpro/raw_data/2000/01/28/old.RCDP.tt37.ripe.net.20000128-000005
/ncc/ttpro/raw_data/2000/01/28/old.SNDP.tt36.ripe.net.20000128
/ncc/ttpro/raw_data/2000/01/29/old.GENE.tt37.ripe.net.20000129
/ncc/ttpro/raw_data/2000/01/29/old.RCDP.tt37.ripe.net.20000129-000005
/ncc/ttpro/raw_data/2000/01/29/old.RCDP.tt37.ripe.net.20000129-010154
/ncc/ttpro/raw_data/2000/01/29/old.RCDP.tt37.ripe.net.20000129-030155
/ncc/ttpro/raw_data/2000/01/29/old.RCDP.tt37.ripe.net.20000129-050153
/ncc/ttpro/raw_data/2000/01/29/old.SNDP.tt36.ripe.net.20000129
/ncc/ttpro/raw_data/2000/01/30/old.GENE.tt37.ripe.net.20000130
/ncc/ttpro/raw_data/2000/01/30/old.RCDP.tt37.ripe.net.20000130-000005
/ncc/ttpro/raw_data/2000/01/30/old.RCDP.tt37.ripe.net.20000130-010154
/ncc/ttpro/raw_data/2000/01/30/old.RCDP.tt37.ripe.net.20000130-030155
/ncc/ttpro/raw_data/2000/01/30/old.RCDP.tt37.ripe.net.20000130-050154
/ncc/ttpro/raw_data/2000/01/30/old.SNDP.tt36.ripe.net.20000130
The  manual check involves:
  1. for each file/directory mentioned in the list consider whether it needs to be preserved or not
  2. remove those files/directories which are not needed
  3. properly rename (or move) files that must be preserved
Usually, the old.* test-traffic files can all be purged, but in some problematic cases (e.g. bad connectivity) the old version might be better. Also it could happen that by accident other important files or directories are created within the TTM data hierarchy. Therefore, instead of blindly removing every file that does not match an expected pattern, cleanup leaves it to an intelligent human operator to judge on that.

Note: if it is decided that all files can be removed, one can use the xargs command in combination with cut and paste to get this done quickly.