Genezzo is an open source SQL database which is written in Perl and runs on a variety of platforms. Genezzo is distinguished from other databases by its uniquely modular and extensible design, which is motivated by a series of operational and architectural goals.
The initial release of Genezzo is a very simple database microkernel. While the Genezzo prototype does not satisfy all of the operational goals, the design was guided by the architectural goals of flexibility and extensibility. Because of this emphasis, new capabilities can be added to the prototype to extend the basic features to fulfill the operational goals. Developers may also write packages that extend Genezzo to suit their particular needs.
The Havok subsystem is used to extend Genezzo. Havok extensions are organized into modules (similar to CPAN modules or Debian Apt-get packages) which can replace existing Genezzo functions or add new capabilities. Each Havok module has a metadata file which describes the module and its dependencies. Genezzo has a SQL function called HavokUse which uses these metadata files to load Havok modules. Currently, the database must be restarted after the module is loaded, but Havok will be extended to support dynamic load and unload for most modules.
Many SQL implementations allow user-defined functions, a way for users to add their own functions which can be evaluated as part of a SQL statement. In the Genezzo implementation of SQL, the only "built-in" SQL function is HavokUse. The rest of the standard SQL functions are defined in a Havok module which is loaded as part of database initialization.
The most ambitious Havok module to date is Eric Rollins' Genezzo::Contrib::Clustered, which converts an existing single-user Genezzo database into a clustered, multi-process, multi-server database with locking and transaction support. This module is layered on top of the base code using only half a dozen "hooks" into the Genezzo buffer cache.
All space management in Genezzo is done using fixed-size data blocks. The allocator usually returns a set of multiple, contiguous blocks which is referred to as an extent. In each file, all the extents for a particular database object, such as a table, are collectively called a segment. The base Genezzo space management is a simple, serial extent allocator which tracks a single highwater mark per segment. An extent is used until it runs out of space, at which point a new extent is allocated. Rather than replace the existing code, a new Havok module is under construction that adds more sophisticated allocation algorithms, like maintaining multiple lists of open extents for parallel updates, and more block usage statistics stored as metadata so empty extents can be re-used. This approach has several advantages:
Havok modules can take advantage of special features of Genezzo that let them modify persistent data structures, as well as dynamic data structures used by the running program.
Because Genezzo is a database, it is only natural that some Havok modules make changes to persistent state. Genezzo has a variety of extensible persistent storage mechanisms:
When a Genezzo database is initialized, it creates about a dozen tables that form the data dictionary. These tables define and describe the database layout, the table defintions, etc. For example, the _pref1 table holds key/value pairs that are used for database initialization. Havok modules may add new parameters to the _pref1 table, query or modify other existing dictionary tables, or create, update, and query their own tables.
Every Genezzo database file starts with a fileheader which contains key/value pairs that list the database version, the blocksize, and other essential information. The test Prefs1.t shows some of the API's which are used to query and update the fileheader. The fileheader parameters are useful because they can be queried and modified before the database is started up -- otherwise, _pref1 parameters are a better choice.
The basic unit of storage in Genezzo is a fixed-size database block which is defined by the RDBlock class. RDBlock is implemented as a tied hash , so the contents of the database block are manipulated using the familiar perl hash interface. In addition to the standard interface, RDBlock supports special metadata entries. While these entries consume space in the block, they are inaccessible and invisible using the standard hash interface.
Metadata rows should only be used to store information that is directly related to the set of data rows in the current block. A dictionary table is a more appropriate location for "general" metadata about the contents of a table. For example, the cluster code uses metadata rows to track the transaction status of database blocks.
B-tree Indexes use the block metadata to define a block traversal order -- each block has metadata rows that point to its children and siblings in the tree.
The Genezzo utilities module contains several functions for packing and unpacking SQL rows or arrays of data to and from a byte string storage format. The PackRow2 function has the provision to add a single metadata column to a row. Currently, this column is used to support rows that span multiple blocks -- the extra column is a chaining pointer that identifies the next piece of the row.
Genezzo supports a number of extensions that allow for the execution of additional code at certain points in the program and for changes to run-time data structures:
The line-mode tool gendba.pl supports a -define parameter which takes a key=value pair as an argument. Some valid keys are dbsize, blocksize, and force_init_db. If the user supplies parameters with unknown keys during database initialization, Genezzo will automatically add these values to the _pref1 dictionary table. If the database is already initialized, _pref1 table will not be updated, but Havok modules can use a dictionary API to view the current command-line definitions.
The Havok subsystem loads additional modules that change and extend the behavior of Genezzo. Genezzo supplies two "top-level" Havok modules, UserFunctions.pm and SysHook.pm, which are designed to load specific subclasses of Havok modules. Currently, changes to Havok modules require a restart to take effect, but future versions of Havok will support dynamic loading when possible, so Havok capabilities can be enabled and disabled while the database is running.
The UserFunctions module lets developers add new SQL functions to Genezzo. It provides a basic mechanism to import Perl functions from other packages into the Genezzo namespace. Future versions of UserFunctions will support parse-time type-checking of function arguments. Genezzo uses this package to load all of the standard SQL functions.
The SysHook module lets developers replace existing system code or add new functions at well-defined locations in the base code. This functionality is similar to Emacs hooks or an aspect supplying advice at join points, though there are some crucial differences. In the case of Emacs, "normal" hooks do not take arguments, they typically ignore the return status of other hooks if multiple hooks are chained, and they are indifferent to their position in the chain. Most Genezzo hooks do take arguments, and they should note the error status of other hooks in the chain to avoid propagating errors and corrupting data. Also, the execution order of hooks is likely to be very important. For example, a chain of hooks that is activated after a buffer read might decompress the buffer and then decrypt its contents. The corresponding buffer write must be preceded by a set of hooks performing complementary operations: an encryption followed by compression. For the case of aspects, the typical notions of "cross-cutting concerns" are features like error handling or logging, which are common to multiple modules. For Genezzo SysHooks, however, a developer can construct a new module that binds hooks to several disparate, unrelated methods in multiple locations to define new functionality. The hooked routines in the base code become "friends" of a new SysHook class.
The current SysHook implementation has some deficiencies compared to
aspect-oriented languages like AspectJ. In AspectJ, a developer can
declare pre or post hooks on any function, but SysHook requires an
if exists(hook) then &hook()
code stub in the function. However,
Damian Conway's
Hook::LexWrap
module does provide pre/post hook functionality on arbitrary perl
functions, so SysHook may be adapted to use this technique. Also, an
aspect declaration can use a regexp to match a set of functions, while
in Genezzo, each hook must explictly declare the function name. It is
feasible to extend Havok so it can examine the symbol table and use a
regexp and/or SQL query-type mechanism to associate multiple functions
with a hook.
In a similar vein to AspectJ inter-type declarations, Genezzo
developers can add new members and methods to existing classes.
Adding a new method is simple -- programmers can use the standard perl
eval to create a new function in the namespace of a specified
package. Since Genezzo classes are constructed from perl hashes,
adding new members dynamically is trivial, so the only challenge is to
use a consistent naming scheme. Genezzo already reserves a
Contrib subdirectory for community contributions, e.g., the
cluster code is stored in CPAN at
Genezzo::Contrib::Clustered.
Similarly, a Genezzo class will maintain a Contrib entry like:
$foo->{Contrib}
.
New elements should be named according to the CPAN package:
e.g. $foo->{Contrib}->{Clustered}
.
An alternate style is to use the full package name as the key:
$foo->{Contrib}->{Genezzo::Contrib::Clustered}
,
which is mainly intended for packages that are not under
Contrib. With this approach, a Havok module can store private
state which is associated with a particular instantiation of a class
in the Genezzo base code. This state can serve as a communication
channel, a method known as
Stigmergy.
While techniques like SysHook and Open Classes allow developers to change the overall behavior of a class, Genezzo also supports a special MailBag argument which lets developers craft changes which specifically target the class constructors in a particular control flow or dynamic scope.
Perl function calls take a general list of arguments, similar to C
varargs. With the exception of a limited function prototype
syntax provided by the compiler, functions must perform their own
checks on the validity of their function arguments. Most functions in
Genezzo follow a standard Perl design pattern which uses the
convention that the argument list is a set of named,
position-independent values, e.g. the code
return table_func(tablename=>"foo", column=>5);
describes a call to
the function table_func
with the named arguments
tablename and column taking the values "foo" and "5",
respectively. Typically, some values are mandatory and some are
optional. When it is desirable to pass values to a function which is
deeply-nested beneath the caller, the function may follow the
convention that "extra" arguments (that is, arguments that are not
from the set of known mandatory and option arguments) are passed along
to any functions beneath the current routine. This practice is quite
flexible, but the resulting code is less comprehensible, since the
knowledge of what the arguments are and where they are defined becomes
obscured, and there is the potential for collision and ambiguity in
the argument names. To mitigate these problems, Genezzo
introduces the convention of a MailBag argument, an argument
that contains multiple parameters for different recipients, which is
specifically intended to be passed down the function call chain.
Note that the Perl Aspect module has a
Wormhole
package which provides somewhat equivalent functionality.
In its current usage, the MailBag is utilized by class constructors and/or initializers, so it is only propagated along function call chains where new classes are instantiated. The Mailbag is "loaded" with messages which contain a named sender and intended recipient (using the Perl package names). A function can call the CheckMail method on the MailBag to see if it has any messages which match its address. The actual contents of the message and the associated recipient behaviors are open-ended, but some standards will probably evolve. The MailBag argument lets classes communicate with other classes that fall within their dynamic scope (as opposed to the more conventional notion of lexical scope).
The primary motivation for MailBag was the need to co-ordinate the actions of space management with block metadata. The buffer cache treats database blocks as raw byte buffers, but certain Havok modules, like Genezzo::Contrib::Clustered, need to update block metadata when blocks are read, written, or updated. The MailBag argument lets the buffer cache establish an association with the instance of an RDBlock class that is created for a each raw byte buffer.
Genezzo supports a variety of mechanisms which let developers construct novel extensions to the base functionality. The data formats are designed for long-term compatibility, while allowing the easy addition of future technologies and techniques as yet undeveloped. Its highly-adaptable design means that the basic architecture is suitable for simple, single-user installations or large clusters.