Advanced Features of C++

Code Review 7: Serialization and Deserialization

The Calculator Object

Look at main: There are too many objects there. The symbol table, the function table and the store. All three objects have the same lifespan--the duration of the program execution. They have to be initialized in particular order and all three of them are passed to the constructor of the parser. They just scream to be combined into a single object called--you guessed it--the Calculator. Embedding them in the right order inside this class will take care of the correct order of initialization.

class Calculator { friend class Parser; public: Calculator () : _funTab (_symTab), _store (_symTab) {} private: Store & GetStore () { return _store; } PFun GetFun (int id) const { return _funTab.GetFun (id); } bool IsFunction (int id) const { return id < _funTab.Size (); } int AddSymbol (std::string const & str) { return _symTab.ForceAdd (str); } int FindSymbol (std::string const & str) const { return _symTab.Find (str); } SymbolTable _symTab; Function::Table _funTab; Store _store; };

Of course, now we have to make appropriate changes (read: simplifications) in main and in the parser. Here are just a few examples--in the declaration of the parser:

class Parser { public: Parser (Scanner & scanner, Calculator & calc); ... private: ... Scanner & _scanner; auto_ptr<Node> _pTree; Status _status; Calculator & _calc; };

and in its implementation.

// Factor := Ident if (id == SymbolTable::idNotFound) { id = _calc.AddSymbol (strSymbol); } pNode = auto_ptr<Node> (new VarNode (id, _calc.GetStore ()));

Have you noticed something? We just went ahead and made another major top-level change in our project, just like this! In fact it was almost trivial to do, with just a little help from the compiler. Here's the prescription.

Start in the spot in main where the symbol table, function table and store are defined (constructed). Replace them with the new object, calculator. Declare the class for Calculator and write a constructor for it. Now, if you are really lazy and tired of thinking, fire off the compiler. It will immediately tell you what to do next: You have to modify the constructor of the parser. You have to pass it the calculator rather than its three separate parts. At this point you might notice that it will be necessary to change the class declaration of the Parser to let it store a reference to the Calculator. Or, you could run the compiler again and let it remind you of it. Next, you will notice all the compilation errors in the implementation of Parser. You can fix them one-by-one, adding new methods to the Calculator as the need arises. The whole procedure is so simple that you might ask an intern who has just started working on the project to do it with minimal supervision.

The moral of this story is that it's never too late to work on the improvement of the high level structure of the project. The truth is that you rarely get it right the first time. And, by the way, you have just seen the method of top-down program modification. You start from the top and let the compiler lead you all the way down to the nitty-gritty details of the implementation. That's the third part of the top-down methodology which consists of:

Top-down design
Top-down implementation and
Top-down modification.

I can't stress enough the importance of the top-down methodology. I have yet to see a clean, well written piece of code that was created bottom-up. You'll hear people saying that some things are better done top-down, others bottom-up. Some people will say that starting from the middle and expanding in both directions is the best way to go. Take all such statements with a very big grain of salt.

It is a fact that bottom-up development is more natural when you have no idea what you're doing-- when your goal is not to write a specific program, but rather to play around with some "neat stuff." It's an easy way, for instance, to learn the interface to some obscure subsystem that you might want to use. Bottom-up development is also preferable if you're not very good at design or if you dislike just sitting there and thinking instead of coding. It is a plus if you enjoy long hours of debugging or have somebody else (hopefully not the end user!) to debug your code.

Finally, if you embrace the bottom-up philosophy, you'll have to resign yourself to never being able to write a professionally looking piece of code. Your programs will always look to the trained eye like those electronics projects created with Radio Shack parts, on breadboards, with bent wires sticking out in all directions and batteries held together with rubber bands.

The real reason I decided to finally get rid of the top level mess and introduce the Calculator object was to simplify the job of adding a new piece of functionality. Every time the management asks you to add new features, take the opportunity to sneak in a little rewrite of the existing code. The code isn't good enough if it hasn't been rewritten at least three times. I'm serious!

By rewriting I don't mean throwing it away and starting from scratch. Just take your time every now and then to improve the structure of each part of the project. It will pay off tremendously. It will actually shorten the development cycle. Of course, if you have stress-puppy managers, you'll have a hard time convincing them about it. They will keep running around shouting nonsense like "if it ain't broken, don't fix it" or "if we don't ship it tomorrow, we are all dead." The moment you buy into that, you're doomed! You'll never be able to do anything right and you'll be spending more and more time fixing the scaffolding and chasing bugs in some low quality temporary code pronounced to be of the "ain't broken" quality. Welcome to the maintenance nightmare!

So here we are, almost at the end of our project, when we are told that if we don't provide a command to save and restore the state of the calculator from a file, we're dead. Fortunately, we can add this feature to the program without much trouble and, as a bonus, do some more cleanup.

Command Parser

We'll go about adding new functionality in an orderly fashion. We have to provide the user with a way to input commands. So far we've had a hack for inputting the quit command--an empty line was interpreted as "quit." Now that we want to add two more commands, save and restore, we can as well find a more general solution. I probably don't have to tell you that, but...

Whenever there are more than two special cases, you should generalize them.

The calculator expects expressions from the user. Let's distinguish commands from expressions by prefixing them with an exclamation sign. Exclamation has the natural connotation of commanding somebody to do something. We'll use a prefix rather than a suffix to simplify our parsing. We'll also make quit a regular command; to be input as "!q". We'll even remind the user of this command when the calculator starts.

cerr << "\n!q to quit\n";

The new Scanner method IsCommand simply checks for the leading exclamation sign. Once we have established that a line of text is a command, we create a simple CommandParser to parse and execute it.

if (!scanner.IsEmpty ()) { if (scanner.IsCommand()) { CommandParser parser (scanner, calc); status = parser.Execute (); } else { Parser parser (scanner, calc); status = parser.Parse (); if (status == stOk) { double result = parser.Calculate (); cout << result << endl; } else { cerr << "Syntax error\n"; } } }

Here's the new class, CommandParser,

class CommandParser { enum ECommand { comSave, comLoad, comQuit, comError }; public: CommandParser (Scanner & scanner, Calculator & calc); Status Execute (); private: Status Load (std::string const & nameFile); Status Save (std::string const & nameFile); Scanner & _scanner; Calculator & _calc; ECommand _command; };

and this is how it parses a command.

CommandParser::CommandParser (Scanner & scanner, Calculator & calc) : _scanner (scanner), _calc (calc) { assert (_scanner.IsCommand()); _scanner.Accept (); std::string name = _scanner.GetSymbolName (); switch (name [0]) { case 'q': case 'Q': _command = comQuit; break; case 's': case 'S': _command = comSave; break; case 'l': case 'L': _command = comLoad; break; default: _command = comError; break; } }

Notice that we use the Scanner method GetSymbolName to retrieve the command string.

The load and save commands require an argument, the file name. We retrieve it from the scanner using, again, the method SymbolName.

Status CommandParser::Execute () { scanner.AcceptCommand (); std::string nameFile; switch (_command) { case comSave: nameFile = _scanner.GetSymbolName (); return Save (nameFile); case comLoad: nameFile = _scanner.GetSymbolName (); return Load (nameFile); case comQuit: cerr << "Good Bye!" << endl; return stQuit; case comError: cerr << "Error" << endl; return stError; } return stOk; }

We use the new method, AcceptCommand, to accept the command and read the following string. The string, presumably a file name, must be terminated by a whitespace. Notice that we can't use the regular Accept method of the Scanner, because it will only read strings that have the form of C++ identifiers. It would stop, for instance, after reading a dot, which is considered a perfectly valid part of a file name. (If we were stricter, we would even make provisions for file names with embedded spaces. We'd just require them to be enclosed in quotation marks.)

void Scanner::AcceptCommand () { ReadChar (); _symbol.erase (); while (!isspace (_look)) { _symbol += _look; _look = _in.get (); } }

As usual, we should provide simple stubs for the Load and Save methods and test our program before proceeding any further.

Serialization and Deserialization

We often imagine data structures as two- or even three-dimensional creatures (just think of a parsing tree, a hash table, or a multi-dimensional array). A disk file, on the other hand, has a one-dimensional structure--it's linear. When you write to a file, you write one thing after another--serially. Hence the name serialization. Saving a data structure means transforming a multi-dimensional idea into its one-dimensional representation. Of course, in reality computer memory is also one-dimensional. Our data structures are already, in some manner, serialized in memory. Some of them, like multi-dimensional arrays, are serialized by the compiler, others are fit into linear memory with the use of pointers. Unfortunately, pointers have no meaning outside the context of the currently running instance of the program. You can't save pointers to a file, close the program, start it again, read the file and expect the newly read pointers to point to the same data structures as before.

In order to serialize a data structure, you have to come up with a well-defined procedure for walking it, i.e., visiting every single element of it, one after another. For instance, you can walk a simple linked list by following the next pointers until you hit the end of the list. If the list is circular, you have to remember the initial pointer and, with every step, compare it with the next pointer. A binary tree can be walked by walking the left child first and the right child next (notice that it's a recursive prescription). For every data structure there is at least one deterministic procedure for walking it, but the procedure might be arbitrarily complicated.

Once you know how to walk a data structure, you know how to serialize it. You have a prescription for how to visit every element of the structure, one after another--a serial way of scanning it. At the bottom level of every data structure you find simple, built-in types, like int, char, long, etc. They can be written to a file following a set of simple rules--we'll come back to this point in a moment. If you know how to serialize each basic element, you're done.

Serializing a data structure makes sense only if we know how to restore it--deserialize it from file to memory. Knowing the original serialization procedure helps--we can follow the same steps when when we deserialize it; only now we'll read from file and write to memory, rather than the other way around. We have to make sure, however, that the procedure is unambiguous. For instance, we have to know when to stop reading elements of a given data structure. We must know where the end of a data structure is. The clues that were present during serialization might not be present on disk. For instance, a linked list had a null pointer as next in its last element. But if we decide not to store pointers, how are we to know when we have reached the end of the list? Of course, we may decide to store the pointers anyway, just to have a clue when to stop. Or, even better, we could store the count of elements in front of the list.

The need to know sizes of data structures before we can deserialize them imposes additional constraints on the order of serialization. When we serialize one part of the program's data, all other parts are present in memory. We can often infer the size of a given data structure by looking into other data structures. When deserializing, we don't have this comfort. We either have to make sure that these other data structures are deserialized first, or add some redundancy to the serialized image, e.g., store the counts multiple times. A good example is a class that contains a pointer to a dynamically allocated array and the current size of the array. It really doesn't matter which member comes first, the pointer or the count. However, when serializing an object we must store the count first and the contents of the array next. Otherwise we won't be able to allocate the appropriate amount of memory or read the correct number of entries.

Another kind of ambiguity might arise when storing polymorphic data structures. For instance, a binary node contains two pointers to Node. That's not a problem when we serialize it--we can tell the two children to serialize themselves by calling the appropriate virtual functions. But when the time comes to deserialize the node, how do we know what the real type of each child was? We have to know that before we can even start deserializing them. That's why the serialized image of any polymorphic data structure has to start with some kind of code that identifies the class of the data structure. Based on this code, the deserializer will be able to call the appropriate constructor.

Let's now go back to our project and implement the (de-) serialization of the Calculator's data structures. First we have to create an output file. This file will be encapsulated inside a serial stream. The stream can accept a number of basic data types, long, double; as well as some simple aggregates, like strings; and write them to the file.

Notice that I didn't mention the most common type--the integer. That's because the size of the integer is system dependent. Suppose that you serialize a data structure that contains integers and send it on a diskette or through e-mail to somebody who has a version of the same program running on a different processor. Your program might write an integer as two bytes and their program might expect a four-byte or even eight-byte integer. That's why, when serializing, we convert the system-dependent types, like integers, to system-independent types like longs. In fact, it's not only the size that matters--the order of bytes is important as well.

There are essentially two kinds of processors, the ones that use the Big Endian and the ones that use the Little Endian order (some can use either). For instance, a short or a long can be stored most-significant-byte-first or least-significant-byte-first. The Intel(tm) family of processor stores the least significant byte first--the Little Endian style--whereas the Motorola(tm) family does the opposite. So if you want your program to inter-operate between Wintel and Macintosh (tm), you'll have to take the order of bytes into account when you serialize. Of course, if you're not planning on porting your program between the two camps, you may safely ignore one of them.
Anyway, in most cases you should take precautions against variable size types and convert integers or enumerations to fixed-size types.

It would be great to be able to assume that once you come up with the on-disk format for your program, it will never change. In real life it would be very na�ve. Formats change and the least you can do to acknowledge it is to refuse to load a format you don't understand.

Always store a version number in your on-disk data structures.

In order to implement serialization, all we have to do is to create a stream, write the version number into it and tell the calculator to serialize itself. By the way, we are now reaping the benefits of our earlier combining several objects into the Calculator object.

const long Version = 1; Status CommandParser::Save (std::string const & nameFile) { cerr << "Save to: \"" << nameFile << "\"\n"; Status status = stOk; try { Serializer out (nameFile); out.PutLong ( Version ); _calc.Serialize (out); } catch (char* msg) { cerr << "Error: Save failed: " << msg << endl; status = stError; } catch (...) { cerr << "Error: Save failed\n"; status = stError; } return status; }

When deserializing, we follow exactly the same steps, except that now we read instead of writing and deserialize instead of serializing. And, if the version number doesn't match, we refuse to load.

Status CommandParser::Load (std::string const & nameFile) { cerr << "Load from: \"" << nameFile << "\"\n"; Status status = stOk; try { DeSerializer in (nameFile); long ver = in.GetLong (); if (ver != Version) throw "Version number mismatch"; _calc.DeSerialize (in); } catch (char* msg) { cerr << "Error: Load failed: " << msg << endl; status = stError; } catch (...) { cerr << "Error: Load failed\n"; // data structures may be corrupt throw; } return status; }

There are two objects inside the Calculator that we'd like to save to the disk--the symbol table and the store--the names of the variables and their values. So that's what we'll do.

void Calculator::Serialize (Serializer & out) { _symTab.Serialize (out); _store.Serialize (out); }

void Calculator::DeSerialize (DeSerializer & in) { _symTab.DeSerialize (in); _store.DeSerialize (in); }

The symbol table consists of a dictionary that maps strings to integers plus a variable that contains the current id. And the simplest way to walk the symbol table is indeed in this order. To walk the standard map we will use its iterator. First we have to store the count of elements, so that we know how many to read during deserialization. Then we will iterate over the whole map and store pairs: string, id. Notice that the iterator for std::map points to a std::pair which has first and second data members. According to our previous discussion, we store the integer id as a long.

void SymbolTable::Serialize (Serializer & out) const { out.PutLong (_dictionary.size ()); std::map<std::string, int>::const_iterator it; for (it = _dictionary.begin (); it != _dictionary.end (); ++it) { out.PutString (it->first); out.PutLong (it->second); } out.PutLong (_id); }

The deserializer must read the data in the same order as they were serialized: first the dictionary, then the current id. When deserializing the map, we first read its size. Then we simply read pairs of strings and longs and add them to the map. Here we treat the map as an associative array. Notice that we first clear the existing dictionary. We have to do it, otherwise we could get into conflicts, with the same id corresponding to different strings.

void SymbolTable::DeSerialize (DeSerializer & in) { _dictionary.clear (); int len = in.GetLong (); for (int i = 0; i < len; ++i) { std::string str = in.GetString (); int id = in.GetLong (); _dictionary [str] = id; } _id = in.GetLong (); }

Notice that for every serialization procedure we immediately write its counterpart--the deserialization procedure. This way we make sure that the two match.

The serialization of the store is also very simple. First the size and then a series of pairs (double, bool).

void Store::Serialize (Serializer & out) const { int len = _aCell.size (); out.PutLong (len); for (int i = 0; i < len; ++i) { out.PutDouble (_aCell [i]); out.PutBool (_aIsInit [i]); } }

When deserializing the store, we first clear the previous values, read the size and then read the pairs (double, bool) one by one. We have a few options when filling the two vectors with new values. One is be to push them back, one by one. Since we know the number of entries up front, we could reserve space in the vectors up front, by calling the method reserve. Here I decided to resize the vectors instead and then treat them as arrays. The resizing fills the vector of doubles with zeroes and the vector of bool with false (these are the default values for these types).

There is an important difference between reserve and resize. Most standard containers have either one or both of these methods. Reserve makes sure that there will be no re-allocation when elements are added, e.g., using push_back, up to the reserved capacity. This is a good optimization, in case we know the required capacity up front. In the case of a vector, the absence of re-allocation also means that iterators, pointers or references to the elements of the vector won't be suddenly invalidated by internal reallocation.
Reserve, however, does not change the size of the container. Resize does. When you resize a container new elements are added to it. (Consequently, you can't resize containers that store objects with no default constructors or default values.)

reserve--changes capacity but not size
resize--changes size

You can enquire about the current capacity of the container by calling its capacity method. And, of course, you get its size by calling size.

void Store::DeSerialize (DeSerializer & in) { _aCell.clear (); _aIsInit.clear (); int len = in.GetLong (); _aCell.resize (len); _aIsInit.resize (len); for (int i = 0; i < len; ++i) { _aCell [i] = in.GetDouble (); _aIsInit [i] = in.GetBool (); } }

Finally, let's have a look at the implementation of the deserializer stream. It is a pretty thin layer on top of the output stream.

#include <fstream> using std::ios_base; const long TruePattern = 0xfab1fab2; const long FalsePattern = 0xbad1bad2; class DeSerializer { public: DeSerializer (std::string const & nameFile) : _stream (nameFile.c_str (), ios_base::in | ios_base::binary) { if (!_stream.is_open ()) throw "couldn't open file"; } long GetLong () { if (_stream.eof()) throw "unexpected end of file"; long l; _stream.read (reinterpret_cast<char *> (&l), sizeof (long)); if (_stream.bad()) throw "file read failed"; return l; } double GetDouble () { double d; if (_stream.eof()) throw "unexpected end of file"; _stream.read (reinterpret_cast<char *> (&d), sizeof (double)); if (_stream.bad()) throw "file read failed"; return d; } std::string GetString () { long len = GetLong (); std::string str; str.resize (len); _stream.read (&str [0], len); if (_stream.bad()) throw "file read failed"; return str; } bool GetBool () { long b = GetLong (); if (_stream.bad()) throw "file read failed"; if (b == TruePattern) return true; else if (b == FalsePattern) return false; else throw "data corruption"; } private: std::ifstream _stream; };

Several interesting things happen here. First of all: What are these strange flags that we pass to ifstream::open ()? The first one, ios_base::in, means that we are opening the file for input. The second one, ios_base::binary, tells the operating system that we don't want any carriage return-linefeed translations.
What is this carriage return-linefeed nonsense? It's one the biggest blunders of the DOS file system, that was unfortunately inherited by all flavors of Windows. The creators of DOS decided that the system should convert single character '\n' into a pair '\r', '\n'. The reasoning was that, when you print a file, the printer interprets carriage return, '\r', as the command to go back to the beginning of the current line, and line feed, '\n', as the command to move down to the next line (not necessarily to its beginning). So, to go to the beginning of the next line, a printer requires two characters. Nowadays, when we use laser printers that understand Postscript and print wysywig documents, this whole idea seems rather odd. Even more so if you consider that an older operating system, Unix, found a way of dealing with this problem without involving low level file system services.
Anyway, if all you want is to store bytes of data in a file, you have to remember to open it in the "binary" mode, otherwise you might get unexpected results. By the way, the default mode is ios_base::text which does the unfortunate character translation.

Another interesting point is that the method ifstream::read reads data to a character buffer--it expects char * as its first argument. When we want to read a long, we can't just pass the address of a long to it--the compiler doesn't know how to convert a long * to a char *. This is one of these cases when we have to force the compiler to trust us. We want to split the long ito its constituent bytes (we're ignoring here the big endian/little endian problem). A reasonably clean way to do it is to use the reinterpret_cast. We are essentially telling the compiler to "reinterpret" a chunk of memory occupied by the long as a series of chars. We can tell how many chars a long contains by applying to it the operator sizeof.

This is a good place to explain the various types of casts. You use

const_cast--to remove the const attribute
static_cast--to convert related types
reinterpret_cast--to convert unrelated types

(There is also a dynamic_cast, which we won't discuss here.)

Here's an example of const_cast:

char const * str = "No modify!"; char * tmp = const_cast<char *> (str); tmp [0] = 'D';

To understand static_cast, think of it as the inverse of implicit conversion. Whenever type T can be implicitly converted to type U (in other words, T is-a U), you can use static_cast to perform the conversion the other way. For instance, a char can be implicitly converted to an int:

char c = '\n'; int i = c; // implicit conversion

Therefore, when you need to convert an int into a char, use static_cast:

int i = 0x0d; char c = static_cast<char> (i);

Or, if you have two classes, Base and Derived: public Base, you can implicitly convert pointer to Derived to a pointer to Base (Derived is-a Base). Therefore, you can use static_cast to go the other way:

Base * bp = new Derived; // implicit conversion Derived * = static_cast<Base *> (bp);

You should realize that casts are dangerous and should be used very judiciously. Try to avoid casting at all costs. Serialization and deserialization are special in this respect, since they require low level manipulation of types.

Finally, notice the strange way we store Boolean values. A Boolean value really requires only one bit for its storage. But, since we don't want to split bytes (or even longs, for that matter), we'll use some redundancy here. We could, in principle store the value true as one and false as zero. However, it will cost us the same to write a zero as to write an arbitrary value. The difference is that zeros are much more common in files than, say, 0xbad1bad2. So when I read back the value 0xbad1bad2 and I expect a Boolean, I feel reassured that I'm reading sensible data and not some random garbage. This is only one of the ways of using redundancy for consistency checking.

The output serializing stream is the mirror image of DeSerializer.

class Serializer { public: Serializer (std::string const & nameFile) : _stream (nameFile.c_str (), ios_base::out | ios_base::binary) { if (!_stream.is_open ()) throw "couldn't open file"; } void PutLong (long l) { _stream.write (reinterpret_cast<char *> (&l), sizeof (long)); if (_stream.bad()) throw "file write failed"; } void PutDouble (double d) { _stream.write (reinterpret_cast<char *> (&d), sizeof (double)); if (_stream.bad()) throw "file write failed"; } void PutString (std::string const & str) { int len = str.length (); PutLong (len); _stream.write (str.data (), len); if (_stream.bad()) throw "file write failed"; } void PutBool (bool b) { long l = b? TruePattern: FalsePattern; PutLong (l); if (_stream.bad ()) throw "file write failed"; } private: std::ofstream _stream; };

There is a shortcut notation combining assignment with a conditional. The following code:

long l = b? TruePattern: FalsePattern;

is equivalent to:

long l; if (b) l = TruePattern; else l = FalsePattern;

The ternary (meaning, three-argument) operator A? B: C first evaluates A. If A is true, it evaluates and returns B, otherwise it evaluates and returns C. A piece of trivia: unlike in C, in C++ the ternary operator returns an l-value, so it can be used on the left-hand-side of the assignment. Not that I would recommend this style!
There is an even more obscure operator in C++, the comma sequencing operator. The expression A, B first evaluates A, then evaluates and returns B. The evaluation of A is therefore a side-effect of the whole operation. Most often the comma operator is used to combine two expressions where one is expected, like in this double loop:

for (int i = 0, j = 0; i < maxI && j < maxJ; ++i, ++j)

By the way, the first comma separates the declarations (complete with initialization) of two variables of the same type. It's the second comma, between ++i and ++j, that is the sequencing operator.

Notice how protective we are when reading from or writing to a file. That's because our program doesn't have full control of the disk. A write can fail because we run out of disk space. This can happen at any time, because we are not the only client of the file system--there are other applications and system services that keep allocating (and presumably freeing) disk space. Reading is worse, because we're not even sure what to expect in the file. Not only may a read fail because of a hardware problem (unreadable disk sector), but we must be prepared for all kinds of sabotage. Other applications could have gotten hold of our precious file and truncated, edited or written all over it. We can't even be sure that the file we are trying to parse has been created by our program. The user could have mistakenly or maliciously pass to our program the name of some executable, a spreadsheet or autoexec.bat.

We already have the first line of defense against such cases of mistaken identity or downright corruption--the version number. The first four bytes we read from the file must match our current version number or we refuse to load it. The error message we display in such a case is a bit misleading. A much better solution would be to spare a few additional bytes and stamp all our files with a magic number. Many people use their initials for the magic number in the hope that one day they'll be able to say to their children or grandchildren, "You see these bytes at the beginning of each file of this type? These are your mom's (dad's, gramma's, grampa's) initials." Provided the application or the system survives that long and is not widely considered an example of bad software engineering.

In-Memory (De-) Serialization

Serialization of data structures is not necessarily related to their storage in files. Sometimes you just want to store some data structure in a chunk of memory, especially if you want to pass it to another application. Programs can talk to each other and pass data through shared memory or other channels (Windows clipboard comes to mind). You might also want to send data in packets across the network. These are all situations in which you can't simply pass pointers embedded in your data. You have to change the format of data.

The serialization procedure is the same, whether the output goes to a file or to memory. In fact, if your data structure is serializable (it has the Serialize and DeSerialize methods), all you might need to do in order to serialize it to memory is to change the implementation of Serializer and DeSerializer. Even better, you might make these classes abstract--turn methods PutLong, PutDouble, PutBool and PutString to pure virtual--and provide two different implementations, one writing to a file and one writing to memory. You can do the same with the deserializer.

There is one big difference between a file and a chunk of memory--the file grows as you write to it, a chunk of memory has fixed size. You have two choices--you can either grow your memory buffer as needed, or you can calculate the required amount of memory up front and pre-allocate the whole buffer. As it turns out, calculating the size of a serializable data structure is surprisingly easy. All you need is yet another implementation of the Serializer interface called the counting serializer. The counting serializer doesn't write anything, it just adds up the sizes of various data types it is asked to write.

class CountingSerializer: public Serializer { public: CountingSerializer () : _size (0) {} int GetSize () const { return _size; } void PutLong (long l) { _size += sizeof (long); } void PutDouble (double d) { _size += sizeof (double); } void PutString (std::string const & str) { _size += sizeof (long); // count _size += str.length (); } void PutBool (bool b) { _size += sizeof (long); } private: int _size; };

For instance, if you wanted to calculate the size of the file or memory buffer required for the serialization of a calculator, you'd call its Serialize method with a counting serializer.

CountingSerializer counter; _calc.Serialize (counter); int size = counter.GetSize ();

Remember that, in order for this to work, all methods of Serializer must be virtual.

Multiple Inheritance

In order to make a class serializable, you have to add to it two methods, Serialize and DeSerialize, and implement them. It makes sense, then, to create a separate abstract class--a pure interface--to abstract this behavior.

class Serializable { public: virtual void Serialize (Serializer & out) const = 0; virtual void DeSerialize (DeSerializer & in) = 0; };

All classes that are serializable, should inherit from the Serializable interface.

class Calculator: public Serializable class SymbolTable: public Serializable class Store: public Serializable

What's the advantage of doing that? After all, even when you inherit from Serializable, you still have to add the declaration of the two methods to you class and you have to provide their implementation. Suppose that a new programmer joins your group and he (or she) has to add a new class to the project. One day he sends you email asking, "How do I make this class serializable?" If this functionality is abstracted into a class, your answer could simply be, "Derive your class from Serializable." That's it! No further explanation is necessary.

There is however a catch. What if your class is already derived from some other class? Now it will have to inherit from that class and from Serializable. This is exactly the case in which multiple inheritance can be put to work. In C++ a class may have more than one base class. The syntax for multiple inheritance is pretty straightforward:

class MultiDerived: public Base1, public Base2

Suppose, for instance, that you were not satisfied with treating std::string as a simple type, known to the Serializer. Instead, you'd like to create a separate type, a serializable string. Here's how you could do it, using multiple inheritance:

using std::string; class SerialString: public string, public Serializable { public: SerialString (std::string const & str): string (str) {} void Serialize (Serializer & out) const; void DeSerialize (DeSerializer & in); };

Multiple inheritance is particularly useful when deriving from abstract classes. This kind of inheritance deals with interface rather than implementation. In fact, this is exactly the restriction on multiple inheritance that's built into Java. In Java you can inherit only from one full-blown class, but you can add to it multiple inheritance from any number of interfaces (the equivalent of C++ abstract classes). In most cases this is indeed a very reasonable restriction.

Next: transactions.