Implementing MySQL's Write-Ahead Log (WAL)

In the last 13 years, I've used many programming languages, frameworks, tools, etc. One thing became clear: relational databases outlive every tech stack.

Let's understand them!

Building a database engine

Even better. What about building a database engine?

In fact, I've already done that. And wrote a book called Building a database engine.

It discusses topics like:

Storage layer
Insert/Update/Delete/Select
Write Ahead Log (WAL) and fault tolerancy
B-Tree indexes
Hash-based full-text indexes
Page-level caching with LRU
...and more

Check it out:

The following article comes from the book.

Introduction

What happens if the server crashes while inserting a record into a table? What happens if you run 10 inserts but the 4th one fails?

The database is probably left in an inconsistent state. Or the table file is corrupted. Meaning, only part of the data was written successfully and we left the file in an invalid state.

These are extremely hard problems to solve. I guess database developers spent years on solving these consistency and fault tolerance related problems. They came up with ACID that consist of lots of different techniques.

Our database is not gonna be a very reliable one. Which is okay. It's just a toy project. But we can add a simplified version of WAL.

One of the most important fault tolerance mechanisms is WAL or write-ahead log. This is how it works:

When an operation is executed it is written to a write-only log. This is the WAL.
Each log entry has attributes like:
- Unique ID
- Operation type (insert, delete, update)
- Data (the newly inserted row, for example)
When the operation finishes the WAL is written to disk. But not to the table file. There's a separate file for the WAL itself.
Once the data is safe in the log file it is updated in the table files.

Real database engines use WAL together with transaction. So the real flow looks like this:

The transaction starts
Each operation is written to an in-memory WAL
The transaction finishes
The in-memory WAL is writte to a log file
Changes are recorded in the table files

This means that the WAL file contains the most updated state at any moment. If, for some reason, writing to the table file fails, the WAL still contains the lost data.

What happens if the server crashes while writing to the table file? At the next startup, data can be restored from the WAL file.

Write-ahead logs help us develop more fault tolerant systems. It is used in many other cases:

Virtually every database engine
Distributed systems such as Kafka or Kubernetes
Journalig file systems
Message queues
Redis and other cache system
git
and the list goes on

But what happens if something goes wrong while writing to the WAL file? That's behind my comprehension.

Simplified version

We'll implement a simplified version of WAL:

Only inserts will be recorded
There's not going to be an in-memory WAL, only the file
No transactions

We'll have a dedicated WAL to every table. So there are two files per table:

users.bin (the actual table file)
users_wal.bin (the write-ahead log for users)

This is much easier than putting everything into the users.bin file.

This how the process works:

Before writing to the table file, the engine will append the new record to the log file
This new log entry has a unique ID
It writes the data to the table file (this is the insert we've already discussed)
Then it "commits" the ID

The last step is very important and it has nothing to do with tranactions.

There are two branches the execution can take:

The insert is executed successfully
It fails

Successful insert

Let's imagine this scenario with pseudo-code:

insert(record) {
  uniqueID = wal.append(record)
  
  file.write(record)
  
  wal.commit(uniqueID)
}

append writes the record and some metadata to the write-ahead log:

unique ID	operation	data
wal-1	insert	{"username": "user1", ...}
wal-2	insert	{"username": "user2", ...}
wal-3	insert	{"username": "user3", ...}

Let's say user3 is the record we want to insert. It is now appended to the log.

Next, the function writes user3 into the table file.

As the last step, we need to "commit" the ID wal-3 because we know it is successfully written to the table file. So at the next startup wal-3 doesn't need to be restored. It is safe.

For now, you can imagine commit as a flag in the log file:

unique ID	operation	data	committed
wal-1	insert	{"username": "user1", ...}	true
wal-2	insert	{"username": "user2", ...}	true
wal-3	insert	{"username": "user3", ...}	true

In other words "committed" means "is it written to the table file as well?"

When wal.append first appends the row, committed is false. And then wal.commit sets it to true.

Failure

Now let's see what happens when things don't work out as expected:

insert(record) {
  uniqueID = wal.append(record)
  
  err = file.write(record)
  if err != nil {
    return err
  }
  
  wal.commit(uniqueID)
}

The record is appended to the log with commited=false

unique ID	operation	data	committed
wal-1	insert	{"username": "user1", ...}	true
wal-2	insert	{"username": "user2", ...}	true
wal-3	insert	{"username": "user3", ...}	false

Then file.write fails so the record is not written to disk and the log entry is not committed.

At the next startup, the engine checks uncommitted entries and tries to persist them into the table file:

startup() {
  wal.restore()
}

// in the WAL package
restore() {
  // returns wal-3
  entries := getUncommittedEntries()
  for e := range entries {
    insert(e.data)
  }
}

It reads uncommitted entries from the WAL file and inserts them into the table.

This is an oversimplified version of WAL. For example, partial write errors can still happen. Meaning, file.write writes some bytes and then fails. In this case, the table file is left in a bad state.

But this simplified version provides a mininum fault tolerence.

Append

Every WAL-related code is located in internal/table/wal: martinjoo.dev

This is the WAL object:

const (
    FilenameTmpl       = "%s_wal.bin"
    LastIDFilenameTmpl = "%s_wal_last_commit.bin"
)

type (
    WAL struct {
        file           *os.File
        lastCommitFile  *os.File
    }
)

It has a file and a lastCommitFile. In the previous chapter, I introduced the committed flag. I lied to you. In the real implementation, we're going to store the last committed ID in a separate file. Later, I'll explain why.

This means each table has three files:

users.bin (the actual table file)
users_wal.bin (the write-ahead log for users)
users_wal_last_commit.bin (the ID of the last commit)

Once again, I'm keeping three files to simplifiy our lives. In reality, it should be one users.bin file.

There's an exported (public) Append function:

func (w *WAL) Append(op, table string, data []byte) (*Entry, error) {
    id, err := generateID()
    if err != nil {
        return nil, fmt.Errorf("WAL.Append: %w", err)
    }

    if _, err := w.file.Seek(0, io.SeekEnd); err != nil {
        return nil, fmt.Errorf("WAL.Append: %w", err)
    }
}

It takes three arguments:

op is the operation such "insert"
table holds the table name in which the change happened. This is not necessary, because we have a dedicated WAL file for each table but each WAL entry will store the table name as well.
data is the actual record that is being inserted. It's the TLV encodied byte representation of the record. Since we only handle inserts a byte array can used to represent the data. Upate or delete would require a more complicated object with the old data and new data, for example.

generateID returns a random, unique 32-character long string, such as:

6d60a1d09898e9057e41f950447f5f90

This is going to be the unique ID for each log entry.

Next, the function seeks the file to the end. This is important. WAL is an append-only log file so each new entry is appended to the end.

The next step is marshaling the data into a TLV encoded byte sequence. Even though, it's "just" a log, we still use TLV:

marshaler := walencoding.NewWALMarshaler(id, op, table, data)
buf, err := marshaler.MarshalBinary()
if err != nil {
  return nil, fmt.Errorf("WAL.Append: %w", err)
}

There's a WALMarshaler object that does the job. It works like any marshaler we've seen so far:

type WALMarshaler struct {
    ID    string
    Table string
    Op    string
    Data  []byte
}

func (m *WALMarshaler) MarshalBinary() ([]byte, error) {
  // ...
}

It accepts the to-be-encoded data as its fields and implements the MarshalBinary function.

At this point, you have seen enough marshaling logic, so I skip the boring part. It uses the same objects and functions from earlier chapters. For example:

func (m *WALMarshaler) MarshalBinary() ([]byte, error) {
  buf := bytes.Buffer{}

  typeMarshaler := encoding.NewValueMarshaler(types.TypeWALItem)
  typeBuf, err := typeMarshaler.MarshalBinary()
  if err != nil {
    return nil, fmt.Errorf("WALMarshaler.MarshalBinary: %w", err)
  }
  buf.Write(typeBuf)  
  
  // ...
  
  return buf.Bytes(), nil
}

It encodes the data and returns a TLV byte sequence:

20 120 0 0 0 -> WAL type, 120 bytes long 
2 32 0 0 0 53 54 99 54 54 100 99 49 99 101 50 100 54 100 50 49 57 55 56 55 52 49 54 55 97 57 100 55 98 49 102 101 -> string, 32-char long, the unique ID
2 6 0 0 0 105 110 115 101 114 116 -> string, 6-char long, op type, "insert"
2 5 0 0 0 117 115 101 114 115 -> string, 5-char long, table name, "users"
100 57 0 0 0 1 8 0 0 0 1 0 0 0 0 0 0 0 2 5 0 0 0 117 115 101 114 49 3 1 0 0 0 20 2 17 0 0 0 115 111 102 116 119 97 114 101 32 101 110 103 105 110 101 101 114 4 1 0 0 0 1

In more human-friendly form:

Type	ID	Op	Table	Data
WAL	6d60a1d09898e9057e41f950447f5f90	insert	users	{"username": "user1". ...}

The last part 100 57 0 0 0 is actual users record. 100 for type record, and it's 57-byte long.

After data is encoded, it needs to be written into the file:

if err = w.write(buf); err != nil {
  return nil, fmt.Errorf("WAL.Append: %w", err)
}

return newEntry(id, buf), nil

write is very simple helper function:

func (w *WAL) write(buf []byte) error {
    n, err := w.file.Write(buf)
    if err != nil {
        return fmt.Errorf("WAL.write: %w", err)
    }
    if n != len(buf) {
    return fmt.Errorf("WAL.write: incomplete write. expected: %d, actual: %d", n, len(buf))
    }
    return nil
}

It writes to file.

And finally the function returns an instance of Entry:

Entry struct {
  ID  string
  Len uint32
}

It holds the generated random ID, and the length of the entire log enrtry. Length will be important later.

As you might expected, Append wasn't complicated at all:

func (w *WAL) Append(op, table string, data []byte) (*Entry, error) {
    id, err := generateID()
    if err != nil {
        return nil, fmt.Errorf("WAL.Append: %w", err)
    }

    if _, err := w.file.Seek(0, io.SeekEnd); err != nil {
        return nil, fmt.Errorf("WAL.Append: %w", err)
    }

    marshaler := walencoding.NewWALMarshaler(id, op, table, data)
    buf, err := marshaler.MarshalBinary()
    if err != nil {
        return nil, fmt.Errorf("WAL.Append: %w", err)
    }

    if err = w.write(buf); err != nil {
        return nil, fmt.Errorf("WAL.Append: %w", err)
    }

    return newEntry(id, buf), nil
}

Append is called in Table.Insert. Before writing to the table file:

func (t *Table) Insert(record map[string]interface{}) (int, error) {
  if _, err := t.file.Seek(0, io.SeekEnd); err != nil {
        return 0, fmt.Errorf("Table.Insert: %w", err)
    }
    if err := t.validateColumns(record); err != nil {
        return 0, fmt.Errorf("Table.Insert: %w", err)
    }
  
  buf := bytes.Buffer{}
  // ...
  
  walEntry, err := t.wal.Append(walencoding.OpInsert, t.Name, buf.Bytes())
  if err != nil {
    return 0, fmt.Errorf("Table.Insert: %w", err)
  }
  
  n, err := t.file.Write(buf.Bytes())
    if err != nil {
        return 0, fmt.Errorf("Table.Insert: %w", err)
    }
  
    return 1, nil
}

It calls Append before writing to the file.

Now everytime an Insert happens a new log entry will be created in <table>_wal.bin:

The WAL file is created or loaded in WAL's constructor. An instance of WAL is created for each table in Database.readTables:

func (db *Database) readTables() (Tables, error) {
  // ...
  r, err := parserio.NewReader(f)
  columnDefReader := columnio.NewColumnDefinitionReader(r)
  tableName, err := table.GetTableName(f)
  if err != nil {
    return nil, fmt.Errorf("Database.readTables: %w", err)
  }

  writeAheadLog, err := wal.NewWAL(db.path, tableName)
  if err != nil {
    return nil, fmt.Errorf("Database.readTables: %w", err)
  }

  t, err := table.NewTable(f, r, columnDefReader, writeAheadLog)
  if err != nil {
    return nil, fmt.Errorf("Database.readTables: %w", err)
  }
}

And of course, Table has a field of type WAL:

type Table struct {
    Name        string
    file        *os.File
    columnNames []string
    columns     Columns

    reader          *platformio.Reader
    recordParser    *parser.RecordParser
    columnDefReader *columnio.ColumnDefinitionReader
    wal             *wal.WAL
}

Building a database engine

This whole article comes from my new book - Building a database engine

It discusses topics like:

Storage layer
Insert/Update/Delete/Select
Write Ahead Log (WAL) and fault tolerancy
B-Tree indexes
Hash-based full-text indexes
Page-level caching with LRU
...and more

Check it out:

Commit

Commit takes an Entry (created by Append) encodes it to TLV and writes it to <table>_wal_last_commit.bin:

func (w *WAL) Commit(entry *Entry) error {
    marshaler := walencoding.NewLastCommitMarshaler(entry.ID, entry.Len)
    data, err := marshaler.MarshalBinary()
    if err != nil {
        return fmt.Errorf("WAL.Commit: %w", err)
    }

    if err := os.WriteFile(w.lastCommitFile.Name(), data, 0644); err != nil {
        return fmt.Errorf("WAL.Commit: %w", err)
    }
    return nil
}

It couldn't be simpler than this.

The only unusual thing is this:

if err := os.WriteFile(w.lastCommitFile.Name(), data, 0644); err != nil {
  return fmt.Errorf("WAL.Commit: %w", err)
}

Instead of:

w.lastCommitFile.write(buf)

os.WriteFile will override the file' content. This is exactly what we need. The last commit file always contains one row. The last commit.

In fact, it contains the last commit ID and the lenght of the last entry. For example:

21 46 0 0 0 -> "last commit" type, 46 bytes long 
2 32 0 0 0 53 57 100 98 101 52 57 52 53 51 102 97 51 54 49 101 56 97 102 57 53 99 51 98 102 53 56 56 49 100 57 98 -> string, the 32-char unique ID
5 4 0 0 0 125 0 0 0 -> uint32, the length of the last log entry, 125 bytes

In more human-friendly form:

Type	ID	Length
last commit	6d60a1d09898e9057e41f950447f5f90	125

There's a dedicated marshaler object for the last commit data:

func (m *LastCommitMarshaler) MarshalBinary() ([]byte, error) {
    buf := bytes.Buffer{}
    typeMarshaler := encoding.NewValueMarshaler(types.TypeWALLastIDItem)
    typeBuf, err := typeMarshaler.MarshalBinary()
    if err != nil {
        return nil, fmt.Errorf("WAL.Commit: %w", err)
    }
    buf.Write(typeBuf)
  
  // ...
}

It works as every other marshaler in the system.

The last step is actually calling commit in Table.Insert:

func (t *Table) Insert(record map[string]interface{}) (int, error) {
  if _, err := t.file.Seek(0, io.SeekEnd); err != nil {
        return 0, fmt.Errorf("Table.Insert: %w", err)
    }
    if err := t.validateColumns(record); err != nil {
        return 0, fmt.Errorf("Table.Insert: %w", err)
    }
  
  walEntry, err = t.wal.Append(walencoding.OpInsert, t.Name, buf.Bytes())
  if err != nil {
    return 0, fmt.Errorf("Table.Insert: %w", err)
  }
  
  n, err := t.file.Write(buf.Bytes())
    if err != nil {
        return 0, fmt.Errorf("Table.Insert: %w", err)
    }
  
  if err := t.wal.Commit(walEntry); err != nil {
    return 0, fmt.Errorf("Table.Insert: %w", err)
  }
  
  return 1, nil
}

This is the exact flow we discussed in the introduction:

Appending to the log
Writing into the table file
Commiting

If t.file.Write() fails the given ID is never committed. In the log file we have data like this:

Type	ID	Op	Table	Data
WAL	6d60a1d09898e9057e41f950447f5f90	insert	users	{"username": "user1". ...}
WAL	95ba9b6ec64090a0668616d985118e45	Insert	users	{"username": "user2". ...}

But the last commit file contains a row like this:

Type	ID	Length
last commit	6d60a1d09898e9057e41f950447f5f90	125

The newly inserted 95ba9b6ec64090a0668616d985118e45 entry is not committed. When starting the database server it knows it needs to restore ID 95ba9b6ec64090a0668616d985118e45 and write it to the table file.

Restore

Collecting restorable data

The first step in restore is to get the data that needs to be restored. Here's the process:

Read the last committed ID from <table>_wal_last_commit.bin
Read the last entry from WAL
Compare the IDs

If the two IDs are identical there's nothing to do. Everything is up-to-date.

This is actually very important.

Restore will run each time the server is started. The log file can have millions of records. 99% of the time there's not going to be anything to restore. We need to optimize for this use case because it happens a lot.

This is why we store the length of the last entry in the last commit file. By knowing the ID and the length, we can check the last entry in the WAL file very quickly.

In pseudo code:

lastCommitID = "6d60a1d09898e9057e41f950447f5f90"
lastEntryLength = 125

// The file pointer points at the beginning of the last entry
walFile.Seek(-1*lastEntryLength, io.SeekEnd)
lastEntry = walFile.Read()

// Everything's fine
if lastEntry.ID == lastCommitID {
  return
}

With only one read operation we can handle the most common use case.

Storing the last entry's length is important because Seek can be used as this:

file.Seek(-1*length, io.SeekEnd)

It seeks length number of bytes starting from the end of the file. So we can position the file pointer to the beginning of the last entry.

After that, if we read length number of bytes we get the last entry:

buf := make([]byte, length)
n, err := file.Read(buf)

That's the happy path. But what happens if the two IDs are different?

Seek to the beginning of the file
Read every entry one by one
Check if the current entry is the last committed one
Every entry after the last committed ID needs to be collected
Write the collected entries intto the file

Remember, WAL is append-only. Entries are stored in cronological order. It means, if the function sees the last committed ID it is guaranteed that every entry after that needs to be collected and written into the table file.

As you can see, it's a linear O(n) search. Which is slow. Especially if the table is large. A great optimization would be to use auto increment IDs instead of random strings. Since the log file is ordered binary sesarch could have been used. Which means O(logn) time complexity. But our engine doesn't support auto increments and I wanted to keep it simple.

The first thing to do is reading the last committed ID:

func (w *WAL) GetRestorableData() (*RestorableData, error) {
    if _, err := w.lastCommitFile.Seek(0, io.SeekStart); err != nil {
        return nil, fmt.Errorf("WAL.GetRestorableData: seek: %w", err)
    }

    data := make([]byte, 1024)
    n, err := w.lastCommitFile.Read(data)
    if err != nil {
        return nil, fmt.Errorf("WAL.GetRestorableData: read: %w", err)
    }

    data = data[:n]
    unmarshaler := walencoding.NewLastCommitUnmarshaler()
    if err = unmarshaler.UnmarshalBinary(data); err != nil {
        return nil, fmt.Errorf("WAL.GetRestorableData: unmarshal: %w", err)
    }
    lastCommittedID := unmarshaler.ID
}

It reads 1024 bytes from the beginning of the last commit file and then decodes it. The last commit file is super small, always contains one row, and it's fix-sized. 1024 bytes is way more than its size.

The next is reading the last entry from WAL and comparing the two IDs:

func (w *WAL) GetRestorableData() (*RestorableData, error) {
  // ...
  lastEntry, err := w.readLastEntry(unmarshaler.Len)
  if err != nil {
    return nil, fmt.Errorf("WAL.GetRestorableData: %w", err)
  }

  if lastEntry.ID == lastCommittedID {
    return nil, nil
  }
}

Remember, the last commit file contains the record's length that can be used to easily read the last row.

If the two IDs are equal it return. In this case, there's nothing to restore.

ReadLastEntry looks like this:

func (w *WAL) rReadLastEntry(length uint32) (*Entry, error) {
    if _, err := w.file.Seek(-1*int64(length), io.SeekEnd); err != nil {
        return nil, fmt.Errorf("WAL.ReadLastEntry: %w", err)
    }

    buf := make([]byte, length)
    n, err := w.file.Read(buf)
    if err != nil {
        return nil, fmt.Errorf("WAL.ReadLastEntry: %w", err)
    }
    if n != int(length) {
        return nil, fmt.Errorf("WAL.ReadLastEntry: incomplete read")
    }
  
  // ...
}

This is the important part. After that, it just decodes the byte sequence (by using existing objects we've alreay discussed) and returns an Entry.

If the two IDs are not identical, we need to do a full log scan:

func (w *WAL) GetRestorableData() (*RestorableData, error) {
  // ...
  if lastEntry.ID == lastCommittedID {
        return nil, nil
    }
  
  buf, err := w.getRestorableData(lastCommittedID)
    if err != nil {
        return nil, fmt.Errorf("WAL.GetRestorableData: %w", err)
    }
  
  return &RestorableData{
        LastEntry: lastEntry,
        Data:      buf,
    }, nil
}

getRestorableData runs the loop and it's a bit longer:

func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
    if _, err := w.file.Seek(0, io.SeekStart); err != nil {
        return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
    }

    r, err := platformio.NewReader(w.file)
    if err != nil {
        return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
    }
  
  // ...
}

It seeks back the file and creates a Reader that we built earlier.

Then is starts the for loop:

func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
  // ...
  
  commitIDFound := false
  buf := bytes.Buffer{}
  for {
    t, err := r.ReadByte()
    if err != nil {
      if err == io.EOF {
        return buf.Bytes(), nil
      }
      return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
    }
    if t != types.TypeWALEntry {
      return nil, fmt.Errorf("WAL.getRestorableData: invalid type")
    }
  }
}

commitIDFound will be set to true once we found the ID. Every entry after that needs to be returned.

It starts a loop that reads the type flag as a byte. Two special cases here:

If it hits EOF then we processed the whole file. In this case, buf can be returned. The function returns the byte sequence that can directly be inserted into the table file later.
If the byte is not TypeWALEntry then something went wrong. The file is TLV encoded so each log entry must start with a TypeWALEntry byte.

If you remember the WAL file looks like this:

Type	ID	Op	Table	Data
WAL	6d60a1d09898e9057e41f950447f5f90	insert	users	{"username": "user1". ...}
WAL	95ba9b6ec64090a0668616d985118e45	Insert	users	{"username": "user2". ...}

After reading the type and length bytes, we can just parse TLV values (such as ID, op, table) with the existing TLVParser object:

func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
  // ...
  length, err := r.ReadUint32()
  if err != nil {
    return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
  }

  tlvParser := parser.NewTLVParser(r)
  val, err := tlvParser.Parse()
  id := val.(string)
  if err != nil {
    return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
  }
}

Once the ID is parsed it needs to be checked against the commitID argument:

func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
  // ...
    if id == commitID {
    commitIDFound = true
    if err = w.skipEntry(id, length); err != nil {
      return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
    }
    continue
  }
  
  // ...
}

commitIDFound is set to true and current log entry is skipped by skipEntry (which just contains two file.Seek calls).

What should happen if the current ID is not equal to commitID and commitIDFound is false? Just skip the current entry:

func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
  // ...
    if id == commitID {
    commitIDFound = true
    if err = w.skipEntry(id, length); err != nil {
      return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
    }
    continue
  }
  
    // We are before the commit ID so this entry can be skipped entirely
  if !commitIDFound {
    if err = w.skipEntry(id, length); err != nil {
      return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
    }
    continue
  }
}

What should happen if the current ID is not equal to commitID but commitIDFound is true? In other words, we are processing entries after the commit ID. We need to collect them:

func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
    if id == commitID {
    // ...
    continue
  }
  
  if !commitIDFound {
    // ...
    continue
  }
  
  // If we are here, the current entry needs to be processed because:
  // commitIDFound is true at this point so we're after the commit ID
}

At this point, there are three remaning fields in the log entry that need to be processed:

Op (such as "insert")
Table (such as "users")
Data (which an entire table record)

Let's grab the first two:

func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
  // ...
  
  // op
  val, err = tlvParser.Parse()
  op := val.(string)
  if op != walencoding.OpInsert {
    return nil, fmt.Errorf("WAL.getRestorableData: unspoorted operation: %s", op)
  }

  // table
  val, err = tlvParser.Parse()
}

As I said, this WAL implementation only supports inserts so it returns an error if op is something else.

After that, the actual record data can be processed. It's an entire TLV record on its own:

100 57 0 0 0 1 8 0 0 0 19 0 0 0 0 0 0 0 ...

The first step is reading the type and length flags and write them into a byte buffer:

func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
    // data
  t, err = r.ReadByte()
  if err != nil {
    return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
  }
  if t != types.TypeRecord {
    return nil, fmt.Errorf("WAL.getRestorableData: invalid type: %d, %d was expected", t, types.TypeRecord)
  }

  length, err = r.ReadUint32()
  if err != nil {
    return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
  }

  buf.WriteByte(t)
  if err = binary.Write(&buf, binary.LittleEndian, length); err != nil {
    return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
  } 
}

Since type is a single byte buf.Write can write it into the buffer. But length is a 32-bit integer so we use binary.Write to encode it correctly.

Finally, value can be processed:

func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
  // ... 
  
  recordBuf := make([]byte, length)
  if _, err = r.Read(recordBuf); err != nil {
    return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
  }
  buf.Write(recordBuf) 
}

A new buffer is created with the length so reader.Read will read the right number of bytes.

At the end, buf contains a byte sequence like this:

100 57 0 0 0 1 8 0 0 0 19 0 0 0 0 0 0 0 ...

This represents one record in a table.

If there are multiple restorable records, buf contains all of them:

100 57 0 0 0 1 8 0 0 0 19 0 0 0 0 0 0 0 ...
100 57 0 0 0 1 8 0 0 0 22 0 0 0 0 0 0 0 ...
100 57 0 0 0 1 8 0 0 0 48 0 0 0 0 0 0 0 ...

It's a simple array of bytes.

That's the entire function. There's no return at the end because it's a loop that checks for EOF at the beginning.

Here's a simplified version of the function so can read it in one go:

func (w *WAL) getRestorableData(commitID string) ([]byte, error) {
    if _, err := w.file.Seek(0, io.SeekStart); err != nil {
        return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
    }

    r, err := platformio.NewReader(w.file)
    commitIDFound := false
    buf := bytes.Buffer{}
    for {
        t, err := r.ReadByte()
        if err != nil {
            if err == io.EOF {
                return buf.Bytes(), nil
            }
            return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
        }
        if t != types.TypeWALEntry {
            return nil, fmt.Errorf("WAL.getRestorableData: invalid type")
        }

        length, err := r.ReadUint32()
        tlvParser := parser.NewTLVParser(r)
        val, err := tlvParser.Parse()
        id := val.(string)

        if id == commitID {
            commitIDFound = true
            if err = w.skipEntry(id, length); err != nil {
                return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
            }
            continue
        }

        // We are before the commit ID so entry can be skipped entirely
        if !commitIDFound {
            if err = w.skipEntry(id, length); err != nil {
                return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
            }
            continue
        }

        // We are after the commit so this entry needs to be restored

        // op
        val, err = tlvParser.Parse()
        op := val.(string)
        if op != walencoding.OpInsert {
            return nil, fmt.Errorf("WAL.getRestorableData: unspoorted operation: %s", op)
        }

        // table
        val, err = tlvParser.Parse()

        // data
        t, err = r.ReadByte()
        if t != types.TypeRecord {
            return nil, fmt.Errorf("invalid type: %d, %d was expected", t, types.TypeRecord)
        }

        length, err = r.ReadUint32()
        buf.WriteByte(t)
        if err = binary.Write(&buf, binary.LittleEndian, length); err != nil {
            return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
        }

        record := make([]byte, length)
        if _, err = r.Read(record); err != nil {
            return nil, fmt.Errorf("WAL.getRestorableData: %w", err)
        }
        buf.Write(record)
    }
}

The function that calls this one returns an instance of RestorableData:

RestorableData struct {
  LastEntry *Entry
  Data      []byte
}

It needs to store the last entry in the log file, because once we write Data to the table file we need to commit the last entry.

Restoring the table

Once we have the data to-be-restored we can add a function to the Table struct that writes the data into the table file and commits the last entry.

This function will be called RestoreWAL. It is executed when the database is started and it reads the tables to a specific database. It calls RestoreWAL for all of its tables.

func (t *Table) RestoreWAL() error {
    if _, err := t.file.Seek(0, io.SeekEnd); err != nil {
        return fmt.Errorf("Table.RestoreWAL: %w", err)
    }

    restorableData, err := t.wal.GetRestorableData()
    if err != nil {
        return fmt.Errorf("Table.RestoreWAL: %w", err)
    }
  
    // Nothing to restore
    if restorableData == nil {
        fmt.Printf("RestoreWAL skipped\n")
        return nil
    }

    n, err := t.file.Write(restorableData.Data)
    if err != nil {
        return fmt.Errorf("Table.RestoreWAL: %w", err)
    }
    if n != len(restorableData.Data) {
        return fmt.Errorf("Table.RestoreWAL: %w", columnio.NewIncompleteWriteError(len(restorableData.Data), n))
    }

    fmt.Printf("RestoreWAL wrote %d bytes\n", n)

    if err = t.wal.Commit(restorableData.LastEntry); err != nil {
        return fmt.Errorf("Table.RestoreWAL: %w", err)
    }

    return nil
}

I couldn't be simpler than this:

Seeks the file to the end because it might append records to it
Gets restorable data from WAL
It the return value is nil it returns early
Otherwise it writes the byte sequence to the table file
And finally it commits the last entry

RestoreWAL is used in Database.NewDatabase that runs whenever you instantiate a new database:

func NewDatabase(name string) (*Database, error) {
  // ...
  
    tables, err := db.readTables()
    if err != nil {
        return nil, fmt.Errorf("NewDatabase: %w", err)
    }
    db.Tables = tables

    for _, t := range db.Tables {
        if err := t.RestoreWAL(); err != nil {
            return nil, fmt.Errorf("NewDatabase: %w", err)
        }
    }

    return db, nil
}

Building a database engine

This whole article comes from my new book - Building a database engine

It discusses topics like:

Storage layer
Insert/Update/Delete/Select
Write Ahead Log (WAL) and fault tolerancy
B-Tree indexes
Hash-based full-text indexes
Page-level caching with LRU
...and more

Check it out: