Alternatives
There are no shortage of well established alternatives to NestedText for storing data in a human-readable text file. The features and shortcomings of some of these alternatives are discussed next. NestedText is intended to be used in situations where people either create, modify, or consume the data directly. It is this perspective that informs these comparisons.
JSON
JSON is a subset of JavaScript suitable for holding data. Like NestedText, it consists of a hierarchical collection of objects (dictionaries), lists, and strings, but also allows numbers, Booleans and nulls. In practice, JSON is largely generated and consumed by machines. The data is stored as text, and so can be read, modified, and consumed directly by the end user, but the format is not optimized for this use case and so is often cumbersome or inefficient when used in this manner.
JSON supports all the native data types common to most languages. Syntax is
added to values to unambiguously indicate their type. For example, 2
,
2.0
, and "2"
are three different values with three different types
(integer, real, string). This adds two types of complexity. First, the rules
for distinguishing various types must be learned and used. Second, all strings
must be quoted, and
with quoting comes escaping, which is needed to allow quote characters to be
included in strings.
JSON was derived as a subset of JavaScript, and so inherits a fair amount of syntactic clutter that can be annoying for users to enter and maintain. In addition, features that would improve clarity are lacking. Comments are not allowed, multiline strings are not supported, and whitespace is insignificant (leading to the possibility that the appearance of the data may not match its true structure).
NestedText only supports three data types (strings, lists and dictionaries) and does not have the baggage of being the subset of a general purpose programming language. The result is a simpler language that has the following clear advantages over JSON as a human readable and writable data file format:
strings do not require quotes
comments
multiline strings
no need to escape special characters
commas are not used to separate dictionary and list items
The following examples illustrate the difference between JSON and NestedText:
JSON
{
"treasurer": {
"name": "Fumiko Purvis",
"address": "3636 Buffalo Ave\nTopeka, Kansas 20692",
"phone": "1-268-555-0280",
"email": "fumiko.purvis@hotmail.com",
"additional roles": [
"accounting task force"
]
}
}
NestedText
treasurer:
name: Fumiko Purvis
# Fumiko's term is ending at the end of the year.
# She will be replaced by Merrill Eldridge.
address:
> 3636 Buffalo Ave
> Topeka, Kansas 20692
phone: 1-268-555-0280
email: fumiko.purvis@hotmail.com
additional roles:
- accounting task force
YAML
YAML is considered by many to be a human friendly alternative to JSON. There is less syntactic clutter and the quoting of strings is optional. However, it also supports a wide variety of data types and formats. The optional quoting can result in the type of values being ambiguous. To distinguish between the various types, a complicated and non-intuitive set of rules developed. YAML at first appears very appealing when used with simple examples, but things can quickly become complicated or provide unexpected results. A reaction to this is the use of YAML subsets, such as StrictYAML. However, the subsets still try to maintain compatibility with YAML and so inherit much of its complexity. For example, both YAML and StrictYAML support nine different ways of writing multiline strings.
YAML avoids excessive quoting and supports comments and multiline strings, but the multitude of formats and disambiguation rules make YAML a difficult language to learn, and the ambiguities creates traps for the user. To illustrate these points, the following is a condensation of a YAML document taken from the GitHub documentation that describes how to configure continuous integration using Python:
YAML
name: Python package
on: [push]
build:
python-version: [3.6, 3.7, 3.8, 3.9, 3.10]
steps:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pytest
if [ -f 'requirements.txt' ]; then pip install -r requirements.txt; fi
- name: Test with pytest
run: |
pytest
YAML (round-trip)
name: Python package
true:
- push
build:
python-version:
- 3.6
- 3.7
- 3.8
- 3.9
- 3.1
steps:
- name: Install dependencies
run: 'python -m pip install --upgrade pip
pip install pytest
if [ -f ''requirements.txt'' ]; then pip install -r requirements.txt; fi
'
- name: Test with pytest
run: 'pytest
'
on
key was inappropriately converted totrue
.Python version
3.10
was inappropriately converted to3.1
.The multiline string was converted to a different representation that added blank lines between each line, greatly confusing the presentation of the string.
Escaping was required for the quotes on
'requirements.txt'
.Indentation is not an accurate reflection of nesting (notice that
python-version
and- 3.6
have the same indentation, but- 3.6
is contained insidepython-version
).
One might expect that the format might change a bit while the underlying
information remains constant. But that is not the case. The ambiguities in the
format result in both on
and 3.10
being changed in value and meaning.
Now consider the NestedText version; it is simpler and not subject to misinterpretation.
NestedText
name: Python package
on:
- push
build:
python-version:
- 3.6
- 3.7
- 3.8
- 3.9
- 3.10
steps:
-
name: Install dependencies
run:
> python -m pip install --upgrade pip
> pip install pytest
> if [ -f 'requirements.txt' ]; then pip install -r requirements.txt; fi
-
name: Test with pytest
run: pytest
simple
unambiguous (no implicit typing)
no unexpected conversions of the data
syntax is insensitive to special characters within text
safe, no risk of malicious code execution
round-tripping from NestedText does not result in changed values or ugly and confusing presentations
TOML or INI
TOML is a configuration file format inspired by the well-known INI syntax. It supports a number of basic data types (notably including dates and times) using syntax that is more similar to JSON (explicit but verbose) than to YAML (succinct but confusing). As discussed previously, though, this makes it the responsibility of the user to specify the correct type for each field.
Another flaw in TOML is that it is difficult to specify deeply nested structures. The only way to specify a nested dictionary is to give the full key to that dictionary, relative to the root of the entire hierarchy. This is not much a problem if the hierarchy only has 1-2 levels, but any more than that and you find yourself typing the same long keys over and over. A corollary to this is that TOML-based configurations do not scale well: increases in complexity are often accompanied by disproportionate decreases in readability and writability.
Here is an example of a configuration file in TOML and NestedText:
TOML
[plugins]
auth = ['avendesora']
archive = ['ssh', 'gpg', 'avendesora', 'emborg', 'file']
publish = ['scp', 'mount']
[auth.avendesora]
account = 'login'
field = 'passcode'
[archive.file]
src = ['~/src/nfo/contacts']
[archive.avendesora]
[archive.emborg]
config = 'rsync'
[publish.scp]
host = ['backups']
remote_dir = 'archives/{date:YYMMDD}'
[publish.mount]
drive = '/mnt/secrets'
remote_dir = 'sparekeys/{date:YYMMDD}'
NestedText
plugins:
auth:
- avendesora
archive:
- ssh
- gpg
- avendesora
- emborg
- file
publish:
- scp
- mount
auth:
avendesora:
account: login
field: passcode
archive:
file:
src:
- ~/src/nfo/contacts
avendesora:
{}
emborg:
config: rsync
publish:
scp:
host:
- backups
remote_dir: archives/{date:YYMMDD}
mount:
drive: /mnt/secrets
remote_dir: sparekeys/{date:YYMMDD}
text does not require quoting or escaping
data is left in its original form
indentation used to succinctly represent nested data
the structure of the file matches the structure of the data
heavily nested data is represented efficiently
CSV or TSV
CSV (comma-separated values) and the closely related TSV (tab-separated values) are exchange formats for tabular data. Tabular data consists of multiple records where each record is made up of a consistent set of fields. The format separates the records using line breaks and separates the fields using commas or tabs. Quoting and escaping is required when the fields contain line breaks or commas/tabs.
Here is an example data file in CSV and NestedText.
CSV
Year,Agriculture,Architecture,Art and Performance,Biology,Business,Communications and Journalism,Computer Science,Education,Engineering,English,Foreign Languages,Health Professions,Math and Statistics,Physical Sciences,Psychology,Public Administration,Social Sciences and History
1970,4.22979798,11.92100539,59.7,29.08836297,9.064438975,35.3,13.6,74.53532758,0.8,65.57092343,73.8,77.1,38,13.8,44.4,68.4,36.8
1980,30.75938956,28.08038075,63.4,43.99925716,36.76572529,54.7,32.5,74.98103152,10.3,65.28413007,74.1,83.5,42.8,24.6,65.1,74.6,44.2
1990,32.70344407,40.82404662,62.6,50.81809432,47.20085084,60.8,29.4,78.86685859,14.1,66.92190193,71.2,83.9,47.3,31.6,72.6,77.6,45.1
2000,45.05776637,40.02358491,59.2,59.38985737,49.80361649,61.9,27.7,76.69214284,18.4,68.36599498,70.9,83.5,48.2,41,77.5,81.1,51.8
2010,48.73004227,42.06672091,61.3,59.01025521,48.75798769,62.5,17.6,79.61862451,17.2,67.92810557,69,85,43.1,40.2,77,81.7,49.3
NestedText
-
Year: 1970
Agriculture: 4.22979798
Architecture: 11.92100539
Art and Performance: 59.7
Biology: 29.08836297
Business: 9.064438975
Communications and Journalism: 35.3
Computer Science: 13.6
Education: 74.53532758
Engineering: 0.8
English: 65.57092343
Foreign Languages: 73.8
Health Professions: 77.1
Math and Statistics: 38
Physical Sciences: 13.8
Psychology: 44.4
Public Administration: 68.4
Social Sciences and History: 36.8
-
Year: 1980
Agriculture: 30.75938956
Architecture: 28.08038075
Art and Performance: 63.4
Biology: 43.99925716
Business: 36.76572529
Communications and Journalism: 54.7
Computer Science: 32.5
Education: 74.98103152
Engineering: 10.3
English: 65.28413007
Foreign Languages: 74.1
Health Professions: 83.5
Math and Statistics: 42.8
Physical Sciences: 24.6
Psychology: 65.1
Public Administration: 74.6
Social Sciences and History: 44.2
-
Year: 1990
Agriculture: 32.70344407
Architecture: 40.82404662
Art and Performance: 62.6
Biology: 50.81809432
Business: 47.20085084
Communications and Journalism: 60.8
Computer Science: 29.4
Education: 78.86685859
Engineering: 14.1
English: 66.92190193
Foreign Languages: 71.2
Health Professions: 83.9
Math and Statistics: 47.3
Physical Sciences: 31.6
Psychology: 72.6
Public Administration: 77.6
Social Sciences and History: 45.1
-
Year: 2000
Agriculture: 45.05776637
Architecture: 40.02358491
Art and Performance: 59.2
Biology: 59.38985737
Business: 49.80361649
Communications and Journalism: 61.9
Computer Science: 27.7
Education: 76.69214284
Engineering: 18.4
English: 68.36599498
Foreign Languages: 70.9
Health Professions: 83.5
Math and Statistics: 48.2
Physical Sciences: 41
Psychology: 77.5
Public Administration: 81.1
Social Sciences and History: 51.8
-
Year: 2010
Agriculture: 48.73004227
Architecture: 42.06672091
Art and Performance: 61.3
Biology: 59.01025521
Business: 48.75798769
Communications and Journalism: 62.5
Computer Science: 17.6
Education: 79.61862451
Engineering: 17.2
English: 67.92810557
Foreign Languages: 69
Health Professions: 85
Math and Statistics: 43.1
Physical Sciences: 40.2
Psychology: 77
Public Administration: 81.7
Social Sciences and History: 49.3
text does not require quoting or escaping
arbitrary data hierarchies are supported
file representation tends to be tall and skinny rather than short and fat
easier to read
Really, Only Strings?
NestedText and its alternatives are all trying to represent structured data. Of them, only NestedText limits you to strings for the scalar values. The alternatives all allow other data types to be represented as well, such as integers, reals, Booleans, etc. Since real applications invariably require all these data types, you might think, “if I use NestedText, I’ll have to convert all these strings myself, and that will make my application code more complicated”. In fact, using NestedText will make your application code more robust with little to no increase in complexity:
Schemas make data conversions easy.
For robustness, all data should be validated when reading it to assure there are no errors. This is performed conveniently and efficiently with a schema. Schemas are used to specify the expected type for each value and are easily extended to perform type conversion as needed. For example, if a particular value should be an integer but a string is provided, as with NestedText, the package that implements the schema can be configured to attempt to convert the string to an integer and only report an error if it cannot.
You have to handle the bad user input anyway.
Applications that need to interpret the input data always make assumptions about the data being read. For example, email fields are expected to contain strings that can be interpreted as an email address. In practice, every field can and probably should be checked in some way. Even with NestedText that constrains the scalar values to strings, one must assure that a list or dictionary is not given where a string is expected. When every value is being checked there little to no benefit to the underlying data receptacle being aware the type of each value. Rather it is very constraining.
No format can support all possible data types.
NestedText gains simplicity by jettisoning native support for scalar data types other than strings. However it is important to recognize that the alternatives must do this as well. There are an unlimited number of data types that can be supported and they cannot support them all. Common data types that are generally not supported include dates, times, and quantities (numbers with units, such as $20.00 and 47 kΩ). With all languages there is a decision to be made: what types should be supported natively. Each additional type increases the complexity of the format. If only strings are supported, as with NestedText, things are pretty simple. Adding any other data type then requires supporting quoting and escaping, which is a substantial jump up in complexity.
Data types that are not natively supported are generally passed as strings
that are later converted to the right type by the end application. This
approach actually provides substantial benefits. The end application has
context that a general purpose data reader cannot have. For example, the
date 10/07/08
could represent either 10 August 2008 or October 7, 2008,
or perhaps even July 8, 2010. Only the user and the application would know
which.
Native data types can be ambiguous.
The type of the value 2
is ambiguous; it may be integer or real. This
may cause problems when combined into an array, such as [1.85, 1.94, 2,
2.09]
. A casually written program may choke on a non-homogeneous array
that consists of an integer among the floats. This is the reason that JSON
does not distinguish between integers and reals.
YAML is notorious for ambiguities because it allows unquoted strings. 2
is a valid integer, real, and string. Similarly, no
is a valid Boolean
and string. In fact, every single value in YAML that is not quoted is also
a valid string. Many people that use YAML simply quote every string, but
that does not solve all the problems because things that are not intended to
be strings can be converted to strings, such as 09
.
There is also the issue of the internal representation of the data. Is the integer represented using 32 bits, 64 bits, or can the integer by arbitrarily large? Is a real number represented as a 64 bit or 128 bit float, or is it represented by a decimal or rational number? Are exceptional values such as infinity or not-a-number supported? Sometimes such things are specified in the definition of the format, but often they are left as details of the implementation. The result could be overflows, underflows, loss of precision, errors, and compatibility issues.
Native data types can lose information.
It is common to format real numbers so as to convey the meaningful precision
of the number. For example, 2
or 2.
represents a number with one
digit of precision, 2.0
represents a number with two digits of
precision, 2.00
represents a number with three digits of precision, etc.
This information on the precision of the number is lost when these numbers
are converted to the float data type.
This same issue also causes problems when representing version numbers. The
number 3.10
is used to represent version three point ten, but when
converted to a float becomes version three point one.
There are also cases where multiple formats map to the same underlying data type. For example, integers may be given in binary, octal, decimal, or hexadecimal formats. YAML provides almost a dozen different ways to specify strings. This causes problems when round-tripping, which is where you read a file, perhaps process it, and then write it back out. Since the data is converted to an internal data type, the original formatting is lost, meaning that the program that writes out the data cannot know how it was originally specified. Integers are generally written out as decimal number regardless of how they were specified. In YAML, the writer checks to see if a string contains a newline and if so simply chooses one of the 9 possible multiline string formats arbitrarily. This is why in the round-trip YAML example given above the Python script ends up being interleaved with blank lines.
Native types may be unfamiliar, inconvenient, or confusing for end users.
Casual users may not understand that 2
is treated differently than
2.0
, which may cause issues in applications that are not carefully
written.
TOML natively accepts dates and times, but only in ISO-8601 formats. Casual users are unlikely to be familiar with this format or may find it awkward or cumbersome.
YAML natively accepts sexagesimal (base 60) numbers in the form 2:30:00
,
which YAML converts to 9000. If this is a duration, it would likely imply
2 hours, 30 minutes and 0 seconds, which totals to 9000 seconds. It may be
also used for the time of day. Someone that normally uses twelve hour time
formatting might write 2:30:00 AM
and get a string. Someone that uses
twenty-four hours formatting might write 2:30:00
and get the integer
9000, or they might write 02:30:00
and get a string. However, if they
entered a time 12 hours later, 16:30:00
, they would get an integer
again.
Native data types are distinguished from each other by using conventions that are second nature to programmers. Conventions such as “you must quote strings”, “quote characters in strings must be escaped”, “you escape an escape character by doubling it up”, “real numbers must contain a decimal point” and “real numbers may not contain units”.
Casual users are unlikely to know these conventions, which causes frustration and errors. Forcing them to know and use these conventions represents an undesirable and sometimes overwhelming burden. This is particularly true for YAML, which can be a minefield even for programmers. Consider the following:
Hey there!
and "Hey there!"
represent the same string.She said, "Hey there!"
is a valid string,
but "She said, "Hey there!""
is an error.She said, "Hey there!"
is a valid string,
but She said: "Hey there!"
is an error (notice the comma is
converted to a colon).3.10.4
is a string, but 3.10
is a real and 3
is an integer.10
is 10, but 010
is 8 and 09
is “09”, a string.Now
is a string, but No
is a Boolean.(1 + 2)
is a string, but [1 + 2]
is a list.02:30:00
is a string but 2:30:00
is 9000.Only programmers with substantial experience with YAML can anticipate or even understand this behavior.
Other languages have similar, but less extreme challenges, particularly the need for quoting and escaping.
Support for non-string types creates the requirement for quoting and escaping, and ultimately leads to either verbosity (JSON) or ambiguity (YAML).
Every additional supported data type brings a challenge; how to unambiguously distinguish it from the others. The challenge is particularly acute for strings because they consist of any possible sequence of characters and so can be confused with all other data types. NestedText addresses this issue by limiting the scalar values to only be strings. That way, there is no need to distinguish the strings from other possible data types.
The alternatives all distinguish strings by surrounding them with quotes. This adds visual clutter and makes them more difficult to type. This is not generally a problem if there are only a few stings, but it becomes a drag if there is are many. However, quoting brings another challenge. Since a string can consist of any sequence of characters, it can include the quote characters. Now the quote characters within the string must be distinguished from the quote characters that delimit the string; a process referred to as escaping the character. This is often done with an special escape character, generally the backslash, but may be done by duplicating the character to be escaped. The string may naturally contain escape characters and they would need escaping as well. This represents a deep hole. For example, consider the following Python dictionary that contains a collection of regular expressions. The regular expressions are quoted strings that by their very nature generally require a large amount of escaping:
regexes = dict(
double_quoted_string = r'"(?:[^"\\]|\\.)*"',
single_quoted_string = r"'(?:[^'\\]|\\.)*'",
identifier = r'[a-zA-Z_][a-zA-Z_0-9]*',
number = r"[+-]?[0-9]+\.?[0-9]*(?:[eE][+-]?[0-9]+)?",
)
Converting this to JSON illustrates the problem:
{
"double_quoted_string": "\"(?:[^\"\\\\]|\\\\.)*\"",
"single_quoted_string": "'(?:[^'\\\\]|\\\\.)*'",
"identifier": "[a-zA-Z_][a-zA-Z_0-9]*",
"number": "[+-]?[0-9]+\\.?[0-9]*(?:[eE][+-]?[0-9]+)?"
}
The number of escape characters more than doubled. This problem does not occur in NestedText, which is actually cleaner than the original Python:
double_quoted_string: "(?:[^"\\]|\\.)*"
single_quoted_string: '(?:[^'\\]|\\.)*'
identifier: [a-zA-Z_][a-zA-Z_0-9]*
number: [+-]?[0-9]+\.?[0-9]*(?:[eE][+-]?[0-9]+)?
Data type is an implementation detail that should not concern the end user.
In general, users that are expected to read, write, or modify structured data benefit from formats tailored to their needs. That only happens when the values are passed as strings that are interpreted by the end application.
Native data types should only be used when both the data generator and the data consumer are machines, preferably using the same software packages to both read and write the data files. In such cases, only programmers would view or edit the files, and only in unusual cases.