Blob

  1. Operations
  2. Conventional attributes

A blob is an unordered set of attribute-value pairs (associative array) constrained by the schema and identified by the hash calculated from its contents.

Values are either strings or sets of strings according to the Attribute cardinality.

type Value
  = StringValue String
  | SetValue (Set String)

type Blob =
  Dict Name Value

For example, given a schema defining attributes name, x and y with datatypes String, Integer and Integer respectively. Also, y has cardinality n. We can define a blob as follows:

Blob
  [ ("name", "Foo")
  , ("x", "0")
  , ("y", ["1", "2"])
  ]

And can be represented as a Bar entity as:

Bar
  { name = Just "Foo"
  , x = Just 0
  , y = Just [1, 2]
  ]

Note that all attributes expect an optional value as explained in the evolution section.

The blob can be serialised in JSON as:

{
  "name": "Foo",
  "x": "0",
  "y": ["1", "2"]
}

Or in CSV as:

name, x, y
Foo, 0, 1;2

In the example above, the JSON serialisation uses the string representation of each primitive value and the schema is needed to cast them back to the right datatype. Check the Serialisation section and the Schema for more details on this topic.

Operations

Hash

The hash is the identity of an blob computed from its content. As the blob hash is part of an entry, it is included in the input to the entry hash function.

The function takes an blob and a hashing algorithm and returns a Hash datatype.

hash : Blob -> HashingAlgorithm -> Hash

Algorithm

When this algorithm operates on hashes (e.g. tag, concatenate) it is done on bytes, not the hexadecimal string representation.

  1. Let blob be the normalised blob of data to hash.
  2. Let hashList be an empty list.
  3. Let valueHash be null.
  4. Foreach (name, value) pair in blob:

    1. If value is null, continue.
    2. If value is a Set:

      1. Let elList be an empty list.
      2. Foreach el in value:

        1. If el starts with **REDACTED**, append el without **REDACTED** to elList.
        2. Otherwise, normalise el according to string normalisation tag it with 0x75 (String), hash it and append it to elList.
        3. Concatenate elList elements, sort them, tag it with 0x73 (Set), hash it and set it to valueHash.
    3. If value starts with **REDACTED**, set valueHash with value without **REDACTED**.
    4. Otherwise, normalise value according to string normalisation tag it with 0x75 (String), hash it and set valueHash.
    5. Tag name with 0x75 (String), hash it and set nameHash.
    6. Concat nameHash and valueHash in this order, and append to hashList.
  5. Sort hashList.
  6. Concat hashList elements, tag with 0x64, hash it and return.

Sorting

The sorting algorithm for a set of hashes is done by comparing the list of bytes one by one. For example, given a set ["foo", "bar"] you'll get the folllowing byte lists after hashing them as unicode:

[ [166,166,229,231,131,195,99,205,149,105,62,193,137,194,104,35,21,217,86,134,147,151,115,134,121,181,99,5,242,9,80,56]
, [227,3,206,11,208,244,193,253,254,76,193,232,55,215,57,18,65,226,224,71,223,16,250,97,1,115,61,193,32,103,93,254]
]

The set sorted given that 166 is smaller than 227.

Tagging

The tagging operation prepends a byte identifying the type to a list of bytes.

Tags:

  • Dict: 0x64
  • Hash: 0x72
  • Set: 0x73
  • String: 0x75

The blob hashing algorithm is an implementation of the objecthash algorithm.

Redact

To redact a value, you need to take its hash (the partial hash resulting of the algorithm above) and, on its string hexadecimal representation prepend the string **REDACTED**.

redact : Name -> HashingAlgorithm -> Blob -> Blob

For example,

i_0 = 
  Blob
    [ ("foo", "abc")
    , ("bar", "xyz")
    ]

redact "foo" i_0

Will result in

Blob
  [ ("foo", "**REDACTED**2a42a9c91b74c0032f6b8000a2c9c5bcca5bb298f004e8eff533811004dea511")
  , ("bar", "xyz")
  ]

In both cases the resulting hash is

12202b90b5d4a714f5fd5f7c670067f090f972dd7be8a472965c90572699249672aa

The reason they are equivalent is because the hash for "abc" is

2a42a9c91b74c0032f6b8000a2c9c5bcca5bb298f004e8eff533811004dea511

Notice that the first two bytes of the resulting hash, 0x12 and 0x20, are prepended because the hashing algorithm used in this example is SHA2-256.

Normalise

Blob normalisation removes nullable values to ensure operations such as hashing produce the exact same value.

Any value that is an empty String, empty Set or Set with empty strings in it is normalised as a null value and removed from the normalised result.

The nullable normalisation allows parity between formats like JSON with more rigid formats like CSV. For example, in CSV having an empty field, empty string, would be normalised as null.

normalise : Blob -> Blob

Algorithm

  1. Let blob be the blob to normalise.
  2. Let result be an empty dictionary.
  3. Foreach (name, value) pair in blob:

    1. If value is null, continue.
    2. If value is an empty String, continue.
    3. If value is an empty Set, continue.
    4. If value is a Set:

      1. Let normSet be an empty Set.
      2. Foreach el in value:

        1. If el is null, continue.
        2. If el is an empty String, continue.
        3. Otherwise, normalise el and append the result to to normSet.
      3. If normSet is empty, continue.
      4. Otherwise, set (name, normSet) to result.
    5. Otherwise,

      1. Let normValue be null.
      2. Normalise value and set normValue.
      3. Set (name, normValue) to result.
  4. Return result.

String normalisation

The string normalisation algorithm is the NFC form as defined by the Unicode standard.

Validate

A blob is valid if all its pairs are valid according to the schema.

validate : Blob -> Result ValidationError Blob

Algorithm

  1. Let blob be the normalised blob of data to hash.
  2. Let schema be the list of attribute definitions to valiate against.
  3. Let result be null.
  4. Foreach (name, value) pair in blob:

    1. If name doesn't exist in the schema, abort. The blob has an illegal attribute.
    2. Let attribute be the attribute for name found in schema.
    3. If value is nullable, abort. The blob is not normalised.
    4. If value is a Set:

      1. If attribute has cardinality “1”, abort. The blob has an illegal value.
      2. Foreach el in value:

        1. If el is nullable, abort. The blob is not normalised.
        2. If el is not of the datatype defined in attribute, abort. The blob has an illegal value.
        3. Otherwise, continue.
    5. If attribute has cardinality “n”, abort. The blob has an illegal value.
    6. If value is not of the datatype defined in attribute, abort. The blob has an illegal value.
    7. Otherwise, continue.
  5. Set blob to result and return.

“nullable” means any value that is considered null in the normalisation process.

Conventional attributes

This section is non-normative.

It is convention for most registers to provide a few common attributes with particular meaning. These are:

  • start-date: (Datetime) The date the element started to exist in the world. This is not the same as the Entry timestamp.
  • end-date: (Datetime) The date the element stopped to exist in the world.
  • name: (String) The common name for the element.

For example, a register could identify an element with DD (ISO 3166-2 for "Germany Democratic Republic") with the data:

Blob
  [ ("start-date", "1949")
  , ("end-date", "1990-10-02")
  , ("official-name", "Germany Democratic Republic")
  , ("name", "East Germany")
  ]

But being added to the register on 2016:

Entry
  { number : 3
  , key: ID "DD"
  , timestamp : Timestamp (2016, 4, 5, 13, 23, 5, Utc)
  , blob : Hash "1220e1357671d0da24668952373d0cdf9f7659a1b155e45c8fb3c2f24331e46edc26"
  }

© Crown copyright released under the Open Government Licence.