<condense/>
The condense section controls how and when the GBUdb database is condensed. When the database is condensed, all of the records in the dataset are reviewed and the good and bad counts are reduced by half for each record. Records that end up with zeros in both counts are removed from the dataset thus making it smaller.
The condensation process has the effect of reducing the confidence figure for each IP record without significantly altering the probability figure. For example: If an IP has 20 bad events recorded and 10 good events recorded then after that record is condensed the counts will be 10 bad and 5 good. In both cases, the probability figure remains the same. (See the ratios have not changed. There are twice as many bad events as good events before and after the condensation process.)
There are two goals for the condensation proces:
- Over time the GBUdb system forgets about IPs that it doesn't see in its email traffic.
- The size of the GBUdb dataset stays manageable without losing the data that matters most.
Condensation is normally triggered once per day so that all IP statistics condense at a steady, predictable rate. However, it is possible to use other triggers to handle special situations. The possible triggers are:
- The time-trigger - active by default, once per day.
- The posts-trigger - causes condensation after a fixed number of posts.
- The records-trigger - causes condensation when a record* limit is reached.
- The size-trigger - causes condensation when a size* limit is reached.
* The size in bytes required for the GBUdb dataset is not directly related to the number of IP records stored. This is because the structure of the GBUdb database is a fixed hierarchical tree - a structure that guarantees a maximum fixed cost per query/update which is very small. The GBUdb database technology is extremely fast and predictable. In the worst case, only a few dozen lines of code (only a few hundred machine instructions) must be executed to complete a query or an update.
To put this in perspective, one of the steps in our regression test bed for GBUdb creates, updates, and destroys several million random IPs. On a generic laptop computer that step typically requires fewer than 10 seconds. In contrast, a very busy mail server in an enterprize grade data center might make a few thousand queries per minute. If that were 10,000 (a number about twice as high as we've seen so far) then in that same 10 seconds only about 1700 queries would be required.
If the IP records that are stored are largely from continguous network blocks then the amount of memory required is very small. If the IP records are essentially random then the amount of memory required is larger. In practice, the IPs stored in a GBUdb node consist of a mix of IPs that frequently fall into contiguous network blocks due to the way bot-nets are herded. For example, "Class C" blocks allocated to subscriber networks and dial-up networks are likely to be the source for mulitiple infected PCs.
Both the size-trigger and records-trigger are somewhat arbitrary but due to the way condesnation works they don't usually cause any harm to the way GBUdb functions. In general, IPs that are seen more frequently still have higher confidence figures and IPs that are no longer active eventually disappear. All along the way an IP's probability of sending spam will be reflected with reasonable accuracy.
As a result, it is perfectly safe to use size-trigger and/or records-trigger to limit the amount of RAM that might be required by GBUdb. This can be important in an OEM appliance application, for example. See the size-trigger section for some examples of real-world telemetry that will help you estimate your GBUdb RAM requirements. In general, we get very few questions about this -- most systems never notice the RAM used by GBUdb.
Another important point about size-trigger and records-trigger condensation is that one condensation cycle may not satisfy the constraint or that either of these might be triggered immediately after one of the other triggers. Since the condensation process is a relatively expensive one that takes a lot of attention from the CPU, we have included a guard time to ensure that condensation events don't pile up and cause too much CPU utilization.
<condense minimum-seconds-between='600'>
The minimum-seconds-between='600' attribute (by default) ensures that there will be at least 10 minutes (600 seconds) between each condensation cycle no matter what triggers are in effect.
Please email [email protected] with any questions.