« back published by Martin Joo on February 26, 2025
Implementing MySQL's Write-Ahead Log (WAL)
In the last 13 years, I've used many programming languages, frameworks, tools, etc. One thing became clear: relational databases outlive every tech stack.
Let's understand them!
Building a database engine
Even better. What about building a database engine?
In fact, I've already done that. And wrote a book called Building a database engine.
You can already pre-order it if you want to support me 🙏
It's going to be available on the 8th of April. It discusses topics like:
- Storage layer
- Insert/Update/Delete/Select
- Write Ahead Log (WAL) and fault tolerancy
- B-Tree indexes
- Hash-based full-text indexes
- Page-level caching with LRU
- ...and more
Check it out and pre-order to get the best price:

The following article comes from the book.
Introduction
What happens if the server crashes while inserting a record into a table? What happens if you run 10 inserts but the 4th one fails?
The database is probably left in an inconsistent state. Or the table file is corrupted. Meaning, only part of the data was written successfully and we left the file in an invalid state.
These are extremely hard problems to solve. I guess database developers spent years on solving these consistency and fault tolerance related problems. They came up with ACID that consist of lots of different techniques.
Our database is not gonna be a very reliable one. Which is okay. It's just a toy project. But we can add a simplified version of WAL.
One of the most important fault tolerance mechanisms is WAL or write-ahead log. This is how it works:
- When an operation is executed it is written to a write-only log. This is the WAL.
- Each log entry has attributes like:
- Unique ID
- Operation type (insert, delete, update)
- Data (the newly inserted row, for example)
- When the operation finishes the WAL is written to disk. But not to the table file. There's a separate file for the WAL itself.
- Once the data is safe in the log file it is updated in the table files.
Real database engines use WAL together with transaction. So the real flow looks like this:
- The transaction starts
- Each operation is written to an in-memory WAL
- The transaction finishes
- The in-memory WAL is writte to a log file
- Changes are recorded in the table files
This means that the WAL file contains the most updated state at any moment. If, for some reason, writing to the table file fails, the WAL still contains the lost data.
What happens if the server crashes while writing to the table file? At the next startup, data can be restored from the WAL file.
Write-ahead logs help us develop more fault tolerant systems. It is used in many other cases:
- Virtually every database engine
- Distributed systems such as Kafka or Kubernetes
- Journalig file systems
- Message queues
- Redis and other cache system
- git
- and the list goes on
But what happens if something goes wrong while writing to the WAL file? That's behind my comprehension.
Simplified version
We'll implement a simplified version of WAL:
- Only inserts will be recorded
- There's not going to be an in-memory WAL, only the file
- No transactions
We'll have a dedicated WAL to every table. So there are two files per table:
users.bin (the actual table file)
users_wal.bin (the write-ahead log for users)
This is much easier than putting everything into the users.bin
file.
This how the process works:
- Before writing to the table file, the engine will append the new record to the log file
- This new log entry has a unique ID
- It writes the data to the table file (this is the insert we've already discussed)
- Then it "commits" the ID
The last step is very important and it has nothing to do with tranactions.
There are two branches the execution can take:
- The
insert
is executed successfully - It fails
Successful insert
Let's imagine this scenario with pseudo-code:
insert(record) {
uniqueID = wal.append(record)
file.write(record)
wal.commit(uniqueID)
}
append
writes the record and some metadata to the write-ahead log:
unique ID | operation | data |
---|---|---|
wal-1 | insert | {"username": "user1", ...} |
wal-2 | insert | {"username": "user2", ...} |
wal-3 | insert | {"username": "user3", ...} |
Let's say user3
is the record we want to insert. It is now appended to the log.
Next, the function writes user3
into the table file.
As the last step, we need to "commit" the ID wal-3
because we know it is successfully written to the table file. So at the next startup wal-3
doesn't need to be restored. It is safe.
For now, you can imagine commit as a flag in the log file:
unique ID | operation | data | committed |
---|---|---|---|
wal-1 | insert | {"username": "user1", ...} | true |
wal-2 | insert | {"username": "user2", ...} | true |
wal-3 | insert | {"username": "user3", ...} | true |
In other words "committed" means "is it written to the table file as well?"
When wal.append
first appends the row, committed
is false
. And then wal.commit
sets it to true.
Failure
Now let's see what happens when things don't work out as expected:
insert(record) {
uniqueID = wal.append(record)
err = file.write(record)
if err != nil {
return err
}
wal.commit(uniqueID)
}
The record is appended to the log with commited=false
unique ID | operation | data | committed |
---|---|---|---|
wal-1 | insert | {"username": "user1", ...} | true |
wal-2 | insert | {"username": "user2", ...} | true |
wal-3 | insert | {"username": "user3", ...} | false |
Then file.write
fails so the record is not written to disk and the log entry is not committed.
At the next startup, the engine checks uncommitted entries and tries to persist them into the table file:
startup() {
wal.restore()
}
// in the WAL package
restore() {
// returns wal-3
entries := getUncommittedEntries()
for e := range entries {
insert(e.data)
}
}
It reads uncommitted entries from the WAL file and inserts them into the table.
This is an oversimplified version of WAL. For example, partial write errors can still happen. Meaning, file.write
writes some bytes and then fails. In this case, the table file is left in a bad state.
But this simplified version provides a mininum fault tolerence.
Append
Every WAL-related code is located in internal/table/wal
:
This is the WAL
object:
const (
FilenameTmpl = "%s_wal.bin"
LastIDFilenameTmpl = "%s_wal_last_commit.bin"
)
type (
WAL struct {
file *os.File
lastCommitFile *os.File
}
)
It has a file
and a lastCommitFile
. In the previous chapter, I introduced the committed
flag. I lied to you. In the real implementation, we're going to store the last committed ID in a separate file. Later, I'll explain why.
This means each table has three files:
users.bin (the actual table file)
users_wal.bin (the write-ahead log for users)
users_wal_last_commit.bin (the ID of the last commit)
Once again, I'm keeping three files to simplifiy our lives. In reality, it should be one users.bin
file.
There's an exported (public) Append
function:
func (w *WAL) Append(op, table string, data []byte) (*Entry, error) {
id, err := generateID()
if err != nil {
return nil, fmt.Errorf("WAL.Append: %w", err)
}
if _, err := w.file.Seek(0, io.SeekEnd); err != nil {
return nil, fmt.Errorf("WAL.Append: %w", err)
}
}
It takes three arguments:
op
is the operation such "insert"table
holds the table name in which the change happened. This is not necessary, because we have a dedicated WAL file for each table but each WAL entry will store the table name as well.data
is the actual record that is being inserted. It's the TLV encodied byte representation of the record. Since we only handleinsert
s a byte array can used to represent the data. Upate or delete would require a more complicated object with the old data and new data, for example.
generateID
returns a random, unique 32-character long string, such as:
6d60a1d09898e9057e41f950447f5f90
This is going to be the unique ID for each log entry.
Next, the function seeks the file to the end. This is important. WAL is an append-only log file so each new entry is appended to the end.
The next step is marshaling the data into a TLV encoded byte sequence. Even though, it's "just" a log, we still use TLV:
marshaler := walencoding.NewWALMarshaler(id, op, table, data)
buf, err := marshaler.MarshalBinary()
if err != nil {
return nil, fmt.Errorf("WAL.Append: %w", err)
}
There's a WALMarshaler
object that does the job. It works like any marshaler we've seen so far:
type WALMarshaler struct {
ID string
Table string
Op string
Data []byte
}
func (m *WALMarshaler) MarshalBinary() ([]byte, error) {
// ...
}
It accepts the to-be-encoded data as its fields and implements the MarshalBinary
function.
At this point, you have seen enough marshaling logic, so I skip the boring part. It uses the same objects and functions from earlier chapters. For example:
func (m *WALMarshaler) MarshalBinary() ([]byte, error) {
buf := bytes.Buffer{}
typeMarshaler := encoding.NewValueMarshaler(types.TypeWALItem)
typeBuf, err := typeMarshaler.MarshalBinary()
if err != nil {
return nil, fmt.Errorf("WALMarshaler.MarshalBinary: %w", err)
}
buf.Write(typeBuf)
// ...
return buf.Bytes(), nil
}
It encodes the data and returns a TLV byte sequence:
20 120 0 0 0 -> WAL type, 120 bytes long
2 32 0 0 0 53 54 99 54 54 100 99 49 99 101 50 100 54 100 50 49 57 55 56 55 52 49 54 55 97 57 100 55 98 49 102 101 -> string, 32-char long, the unique ID
2 6 0 0 0 105 110 115 101 114 116 -> string, 6-char long, op type, "insert"
2 5 0 0 0 117 115 101 114 115 -> string, 5-char long, table name, "users"
100 57 0 0 0 1 8 0 0 0 1 0 0 0 0 0 0 0 2 5 0 0 0 117 115 101 114 49 3 1 0 0 0 20 2 17 0 0 0 115 111 102 116 119 97 114 101 32 101 110 103 105 110 101 101 114 4 1 0 0 0 1
In more human-friendly form:
Type | ID | Op | Table | Data |
---|---|---|---|---|
WAL | 6d60a1d09898e9057e41f950447f5f90 | insert | users | {"username": "user1". ...} |
The last part 100 57 0 0 0
is actual users
record. 100
for type record, and it's 57-byte long.
After data is encoded, it needs to be written into the file:
if err = w.write(buf); err != nil {
return nil, fmt.Errorf("WAL.Append: %w", err)
}
return newEntry(id, buf), nil
write
is very simple helper function:
func (w *WAL) write(buf []byte) error {
n, err := w.file.Write(buf)
if err != nil {
return fmt.Errorf("WAL.write: %w", err)
}
if n != len(buf) {
return fmt.Errorf("WAL.write: incomplete write. expected: %d, actual: %d", n, len(buf))
}
return nil
}
It writes to file
.
And finally the function returns an instance of Entry
:
Entry struct {
ID string
Len uint32
}
It holds the generated random ID, and the length of the entire log enrtry. Length will be important later.
As you might expected, Append
wasn't complicated at all:
func (w *WAL) Append(op, table string, data []byte) (*Entry, error) {
id, err := generateID()
if err != nil {
return nil, fmt.Errorf("WAL.Append: %w", err)
}
if _, err := w.file.Seek(0, io.SeekEnd); err != nil {
return nil, fmt.Errorf("WAL.Append: %w", err)
}
marshaler := walencoding.NewWALMarshaler(id, op, table, data)
buf, err := marshaler.MarshalBinary()
if err != nil {
return nil, fmt.Errorf("WAL.Append: %w", err)
}
if err = w.write(buf); err != nil {
return nil, fmt.Errorf("WAL.Append: %w", err)
}
return newEntry(id, buf), nil
}
Append
is called in Table.Insert
. Before writing to the table file:
func (t *Table) Insert(record map[string]interface{}) (int, error) {
if _, err := t.file.Seek(0, io.SeekEnd); err != nil {
return 0, fmt.Errorf("Table.Insert: %w", err)
}
if err := t.validateColumns(record); err != nil {
return 0, fmt.Errorf("Table.Insert: %w", err)
}
buf := bytes.Buffer{}
// ...
walEntry, err := t.wal.Append(walencoding.OpInsert, t.Name, buf.Bytes())
if err != nil {
return 0, fmt.Errorf("Table.Insert: %w", err)
}
n, err := t.file.Write(buf.Bytes())
if err != nil {
return 0, fmt.Errorf("Table.Insert: %w", err)
}
return 1, nil
}
It calls Append
before writing to the file.
Now everytime an Insert
happens a new log entry will be created in <table>_wal.bin
:
The WAL file is created or loaded in WAL's constructor. An instance of WAL
is created for each table in Database.readTables
:
func (db *Database) readTables() (Tables, error) {
// ...
r, err := parserio.NewReader(f)
columnDefReader := columnio.NewColumnDefinitionReader(r)
tableName, err := table.GetTableName(f)
if err != nil {
return nil, fmt.Errorf("Database.readTables: %w", err)
}
writeAheadLog, err := wal.NewWAL(db.path, tableName)
if err != nil {
return nil, fmt.Errorf("Database.readTables: %w", err)
}
t, err := table.NewTable(f, r, columnDefReader, writeAheadLog)
if err != nil {
return nil, fmt.Errorf("Database.readTables: %w", err)
}
}
And of course, Table
has a field of type WAL
:
type Table struct {
Name string
file *os.File
columnNames []string
columns Columns
reader *platformio.Reader
recordParser *parser.RecordParser
columnDefReader *columnio.ColumnDefinitionReader
wal *wal.WAL
}
Building a database engine
This whole article comes from my book - Building a database engine
It's going to be available on the 8th of April. It discusses topics like:
- Storage layer
- Insert/Update/Delete/Select
- Write Ahead Log (WAL) and fault tolerancy
- B-Tree indexes
- Hash-based full-text indexes
- Page-level caching with LRU
- ...and more
Check it out and pre-order to get the best price:
Commit
Commit
takes an Entry
(created by Append
) encodes it to TLV and writes it to <table>_wal_last_commit.bin
:
func (w *WAL) Commit(entry *Entry) error {
marshaler := walencoding.NewLastCommitMarshaler(entry.ID, entry.Len)
data, err := marshaler.MarshalBinary()
if err != nil {
return fmt.Errorf("WAL.Commit: %w", err)
}
if err := os.WriteFile(w.lastCommitFile.Name(), data, 0644); err != nil {
return fmt.Errorf("WAL.Commit: %w", err)
}
return nil
}
It couldn't be simpler than this.
The only unusual thing is this:
if err := os.WriteFile(w.lastCommitFile.Name(), data, 0644); err != nil {
return fmt.Errorf("WAL.Commit: %w", err)
}
Instead of:
w.lastCommitFile.write(buf)
os.WriteFile
will override the file' content. This is exactly what we need. The last commit file always contains one row. The last commit.
In fact, it contains the last commit ID and the lenght of the last entry. For example:
21 46 0 0 0 -> "last commit" type, 46 bytes long
2 32 0 0 0 53 57 100 98 101 52 57 52 53 51 102 97 51 54 49 101 56 97 102 57 53 99 51 98 102 53 56 56 49 100 57 98 -> string, the 32-char unique ID
5 4 0 0 0 125 0 0 0 -> uint32, the length of the last log entry, 125 bytes
In more human-friendly form:
Type | ID | Length |
---|---|---|
last commit | 6d60a1d09898e9057e41f950447f5f90 | 125 |
There's a dedicated marshaler object for the last commit data:
func (m *LastCommitMarshaler) MarshalBinary() ([]byte, error) {
buf := bytes.Buffer{}
typeMarshaler := encoding.NewValueMarshaler(types.TypeWALLastIDItem)
typeBuf, err := typeMarshaler.MarshalBinary()
if err != nil {
return nil, fmt.Errorf("WAL.Commit: %w", err)
}
buf.Write(typeBuf)
// ...
}
It works as every other marshaler in the system.
The last step is actually calling commit in Table.Insert
:
func (t *Table) Insert(record map[string]interface{}) (int, error) {
if _, err := t.file.Seek(0, io.SeekEnd); err != nil {
return 0, fmt.Errorf("Table.Insert: %w", err)
}
if err := t.validateColumns(record); err != nil {
return 0, fmt.Errorf("Table.Insert: %w", err)
}
walEntry, err = t.wal.Append(walencoding.OpInsert, t.Name, buf.Bytes())
if err != nil {
return 0, fmt.Errorf("Table.Insert: %w", err)
}
n, err := t.file.Write(buf.Bytes())
if err != nil {
return 0, fmt.Errorf("Table.Insert: %w", err)
}
if err := t.wal.Commit(walEntry); err != nil {
return 0, fmt.Errorf("Table.Insert: %w", err)
}
return 1, nil
}
This is the exact flow we discussed in the introduction:
- Appending to the log
- Writing into the table file
- Commiting
If t.file.Write()
fails the given ID is never committed. In the log file we have data like this:
Type | ID | Op | Table | Data |
---|---|---|---|---|
WAL | 6d60a1d09898e9057e41f950447f5f90 | insert | users | {"username": "user1". ...} |
WAL | 95ba9b6ec64090a0668616d985118e45 | Insert | users | {"username": "user2". ...} |
But the last commit file contains a row like this:
Type | ID | Length |
---|---|---|
last commit | 6d60a1d09898e9057e41f950447f5f90 | 125 |
The newly inserted 95ba9b6ec64090a0668616d985118e45
entry is not committed. When starting the database server it knows it needs to restore ID 95ba9b6ec64090a0668616d985118e45
and write it to the table file.
Restore
Collecting restorable data
The first step in restore is to get the data that needs to be restored. Here's the process:
- Read the last committed ID from
<table>_wal_last_commit.bin
- Read the last entry from WAL
- Compare the IDs
If the two IDs are identical there's nothing to do. Everything is up-to-date.
This is actually very important.
Restore will run each time the server is started. The log file can have millions of records. 99% of the time there's not going to be anything to restore. We need to optimize for this use case because it happens a lot.
This is why we store the length of the last entry in the last commit file. By knowing the ID and the length, we can check the last entry in the WAL file very quickly.
In pseudo code:
lastCommitID = "6d60a1d09898e9057e41f950447f5f90"
lastEntryLength = 125
// The file pointer points at the beginning of the last entry
walFile.Seek(-1*lastEntryLength, io.SeekEnd)
lastEntry = walFile.Read()
// Everything's fine
if lastEntry.ID == lastCommitID {
return
}
With only one read operation we can handle the most common use case.
Storing the last entry's length is important because Seek
can be used as this:
file.Seek(-1*length, io.SeekEnd)
It seeks length
number of bytes starting from the end of the file. So we can position the file pointer to the beginning of the last entry.
After that, if we read length
number of bytes we get the last entry:
buf := make([]byte, length)
n, err := file.Read(buf)
That's the happy path. But what happens if the two IDs are different?
- Seek to the beginning of the file
- Read every entry one by one
- Check if the current entry is the last committed one
- Every entry after the last committed ID needs to be collected
- Write the collected entries intto the file
Remember, WAL is append-only. Entries are stored in cronological order. It means, if the function sees the last committed ID it is guaranteed that every entry after that needs to be collected and written into the table file.
As you can see, it's a linear O(n) search. Which is slow. Especially if the table is large. A great optimization would be to use auto increment IDs instead of random strings. Since the log file is ordered binary sesarch could have been used. Which means O(logn) time complexity. But our engine doesn't support auto increments and I wanted to keep it simple.
The first thing to do is reading the last committed ID:
func (w *WAL) GetRestorableData() (*RestorableData, error) {
if _, err := w.lastCommitFile.Seek(0, io.SeekStart); err != nil {
return nil, fmt.Errorf("WAL.GetRestorableData: seek: %w", err)
}
data := make([]byte, 1024)
n, err := w.lastCommitFile.Read(data)
if err != nil {
return nil, fmt.Errorf("WAL.GetRestorableData: read: %w", err)
}
data = data[:n]
unmarshaler := walencoding.NewLastCommitUnmarshaler()
if err = unmarshaler.UnmarshalBinary(data); err != nil {
return nil, fmt.Errorf("WAL.GetRestorableData: unmarshal: %w", err)
}
lastCommittedID := unmarshaler.ID
}
It reads 1024 bytes from the beginning of the last commit file and then decodes it. The last commit file is super small, always contains one row, and it's fix-sized. 1024 bytes is way more than its size.
The next is reading the last entry from WAL and comparing the two IDs:
func (w *WAL) GetRestorableData() (*RestorableData, error) {
// ...
lastEntry, err := w.readLastEntry(unmarshaler.Len)
if err != nil {
return nil, fmt.Errorf("WAL.GetRestorableData: %w", err)
}
if lastEntry.ID == lastCommittedID {
return nil, nil
}
}
Remember, the last commit file contains the record's length that can be used to easily read the last row.
If the two IDs are equal it return. In this case, there's nothing to restore.
ReadLastEntry
looks like this:
func (w *WAL) rReadLastEntry(length uint32) (*Entry, error) {
if _, err := w.file.Seek(-1*int64(length), io.SeekEnd); err != nil {
return nil, fmt.Errorf("WAL.ReadLastEntry: %w", err)
}
buf := make([]byte, length)
n, err := w.file.Read(buf)
if err != nil {
return nil, fmt.Errorf("WAL.ReadLastEntry: %w", err)
}
if n != int(length) {
return nil, fmt.Errorf("WAL.ReadLastEntry: incomplete read")
}
// ...
}
This is the important part. After that, it just decodes the byte sequence (by using existing objects we've alreay discussed) and returns an Entry
.
If the two IDs are not identical, we need to do a full log scan:
func (w *WAL) GetRestorableData() (*RestorableData, error) {
// ...
if lastEntry.ID == lastCommittedID {
return nil, nil
}
buf, err := w.getRestorableData(lastCommittedID)
if err != nil {
return nil, fmt.Errorf("WAL.GetRestorableData: %w", err)
}
return &RestorableData{
LastEntry: lastEntry,
Data: buf,
}, nil
}
getRestorableData
runs the loop and it's a bit longer:
func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
if _, err := w.file.Seek(0, io.SeekStart); err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
r, err := platformio.NewReader(w.file)
if err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
// ...
}
It seeks back the file and creates a Reader
that we built earlier.
Then is starts the for loop:
func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
// ...
commitIDFound := false
buf := bytes.Buffer{}
for {
t, err := r.ReadByte()
if err != nil {
if err == io.EOF {
return buf.Bytes(), nil
}
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
if t != types.TypeWALEntry {
return nil, fmt.Errorf("WAL.getRestorableData: invalid type")
}
}
}
commitIDFound
will be set to true once we found the ID. Every entry after that needs to be returned.
It starts a loop that reads the type flag as a byte. Two special cases here:
-
If it hits EOF then we processed the whole file. In this case,
buf
can be returned. The function returns the byte sequence that can directly be inserted into the table file later. -
If the byte is not
TypeWALEntry
then something went wrong. The file is TLV encoded so each log entry must start with aTypeWALEntry
byte.
If you remember the WAL file looks like this:
Type | ID | Op | Table | Data |
---|---|---|---|---|
WAL | 6d60a1d09898e9057e41f950447f5f90 | insert | users | {"username": "user1". ...} |
WAL | 95ba9b6ec64090a0668616d985118e45 | Insert | users | {"username": "user2". ...} |
After reading the type and length bytes, we can just parse TLV values (such as ID, op, table) with the existing TLVParser
object:
func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
// ...
length, err := r.ReadUint32()
if err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
tlvParser := parser.NewTLVParser(r)
val, err := tlvParser.Parse()
id := val.(string)
if err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
}
Once the ID is parsed it needs to be checked against the commitID
argument:
func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
// ...
if id == commitID {
commitIDFound = true
if err = w.skipEntry(id, length); err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
continue
}
// ...
}
commitIDFound
is set to true and current log entry is skipped by skipEntry
(which just contains two file.Seek
calls).
What should happen if the current ID is not equal to commitID
and commitIDFound
is false? Just skip the current entry:
func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
// ...
if id == commitID {
commitIDFound = true
if err = w.skipEntry(id, length); err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
continue
}
// We are before the commit ID so this entry can be skipped entirely
if !commitIDFound {
if err = w.skipEntry(id, length); err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
continue
}
}
What should happen if the current ID is not equal to commitID
but commitIDFound
is true? In other words, we are processing entries after the commit ID. We need to collect them:
func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
if id == commitID {
// ...
continue
}
if !commitIDFound {
// ...
continue
}
// If we are here, the current entry needs to be processed because:
// commitIDFound is true at this point so we're after the commit ID
}
At this point, there are three remaning fields in the log entry that need to be processed:
- Op (such as "insert")
- Table (such as "users")
- Data (which an entire table record)
Let's grab the first two:
func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
// ...
// op
val, err = tlvParser.Parse()
op := val.(string)
if op != walencoding.OpInsert {
return nil, fmt.Errorf("WAL.getRestorableData: unspoorted operation: %s", op)
}
// table
val, err = tlvParser.Parse()
}
As I said, this WAL implementation only supports insert
s so it returns an error if op
is something else.
After that, the actual record data can be processed. It's an entire TLV record on its own:
100 57 0 0 0 1 8 0 0 0 19 0 0 0 0 0 0 0 ...
The first step is reading the type and length flags and write them into a byte buffer:
func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
// data
t, err = r.ReadByte()
if err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
if t != types.TypeRecord {
return nil, fmt.Errorf("WAL.getRestorableData: invalid type: %d, %d was expected", t, types.TypeRecord)
}
length, err = r.ReadUint32()
if err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
buf.WriteByte(t)
if err = binary.Write(&buf, binary.LittleEndian, length); err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
}
Since type is a single byte buf.Write
can write it into the buffer. But length
is a 32-bit integer so we use binary.Write
to encode it correctly.
Finally, value can be processed:
func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
// ...
recordBuf := make([]byte, length)
if _, err = r.Read(recordBuf); err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
buf.Write(recordBuf)
}
A new buffer is created with the length so reader.Read
will read the right number of bytes.
At the end, buf
contains a byte sequence like this:
100 57 0 0 0 1 8 0 0 0 19 0 0 0 0 0 0 0 ...
This represents one record in a table.
If there are multiple restorable records, buf
contains all of them:
100 57 0 0 0 1 8 0 0 0 19 0 0 0 0 0 0 0 ...
100 57 0 0 0 1 8 0 0 0 22 0 0 0 0 0 0 0 ...
100 57 0 0 0 1 8 0 0 0 48 0 0 0 0 0 0 0 ...
It's a simple array of bytes.
That's the entire function. There's no return at the end because it's a loop that checks for EOF
at the beginning.
Here's a simplified version of the function so can read it in one go:
func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
if _, err := w.file.Seek(0, io.SeekStart); err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
r, err := platformio.NewReader(w.file)
commitIDFound := false
buf := bytes.Buffer{}
for {
t, err := r.ReadByte()
if err != nil {
if err == io.EOF {
return buf.Bytes(), nil
}
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
if t != types.TypeWALEntry {
return nil, fmt.Errorf("WAL.getRestorableData: invalid type")
}
length, err := r.ReadUint32()
tlvParser := parser.NewTLVParser(r)
val, err := tlvParser.Parse()
id := val.(string)
if id == commitID {
commitIDFound = true
if err = w.skipEntry(id, length); err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
continue
}
// We are before the commit ID so entry can be skipped entirely
if !commitIDFound {
if err = w.skipEntry(id, length); err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
continue
}
// We are after the commit so this entry needs to be restored
// op
val, err = tlvParser.Parse()
op := val.(string)
if op != walencoding.OpInsert {
return nil, fmt.Errorf("WAL.getRestorableData: unspoorted operation: %s", op)
}
// table
val, err = tlvParser.Parse()
// data
t, err = r.ReadByte()
if t != types.TypeRecord {
return nil, fmt.Errorf("invalid type: %d, %d was expected", t, types.TypeRecord)
}
length, err = r.ReadUint32()
buf.WriteByte(t)
if err = binary.Write(&buf, binary.LittleEndian, length); err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
record := make([]byte, length)
if _, err = r.Read(record); err != nil {
return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
}
buf.Write(record)
}
}
The function that calls this one returns an instance of RestorableData
:
RestorableData struct {
LastEntry *Entry
Data []byte
}
It needs to store the last entry in the log file, because once we write Data
to the table file we need to commit the last entry.
Restoring the table
Once we have the data to-be-restored we can add a function to the Table
struct that writes the data into the table file and commits the last entry.
This function will be called RestoreWAL
. It is executed when the database is started and it reads the tables to a specific database. It calls RestoreWAL
for all of its tables.
func (t *Table) RestoreWAL() error {
if _, err := t.file.Seek(0, io.SeekEnd); err != nil {
return fmt.Errorf("Table.RestoreWAL: %w", err)
}
restorableData, err := t.wal.GetRestorableData()
if err != nil {
return fmt.Errorf("Table.RestoreWAL: %w", err)
}
// Nothing to restore
if restorableData == nil {
fmt.Printf("RestoreWAL skipped\n")
return nil
}
n, err := t.file.Write(restorableData.Data)
if err != nil {
return fmt.Errorf("Table.RestoreWAL: %w", err)
}
if n != len(restorableData.Data) {
return fmt.Errorf("Table.RestoreWAL: %w", columnio.NewIncompleteWriteError(len(restorableData.Data), n))
}
fmt.Printf("RestoreWAL wrote %d bytes\n", n)
if err = t.wal.Commit(restorableData.LastEntry); err != nil {
return fmt.Errorf("Table.RestoreWAL: %w", err)
}
return nil
}
I couldn't be simpler than this:
- Seeks the file to the end because it might append records to it
- Gets restorable data from WAL
- It the return value is
nil
it returns early - Otherwise it writes the byte sequence to the table file
- And finally it commits the last entry
RestoreWAL
is used in Database.NewDatabase
that runs whenever you instantiate a new database:
func NewDatabase(name string) (*Database, error) {
// ...
tables, err := db.readTables()
if err != nil {
return nil, fmt.Errorf("NewDatabase: %w", err)
}
db.Tables = tables
for _, t := range db.Tables {
if err := t.RestoreWAL(); err != nil {
return nil, fmt.Errorf("NewDatabase: %w", err)
}
}
return db, nil
}
Building a database engine
This whole article comes from my book - Building a database engine
It's going to be available on the 8th of April. It discusses topics like:
- Storage layer
- Insert/Update/Delete/Select
- Write Ahead Log (WAL) and fault tolerancy
- B-Tree indexes
- Hash-based full-text indexes
- Page-level caching with LRU
- ...and more