Relational ASCII

Just got done reading C.J. Date's book Database in Depth: Relational Theory for Practitioners about the relational model and relational algebra.

We want to provide an ASCII format for relations. We want also to provide some algebraic operations on them. The term "relation" is used rather than table. A relation is like a simple table with a fixed set of columns. Each column has a name and a type. Column order should not matter as the columns are named.

File Format

Here we give a simple format for a file containing a relation. Tuples (rows) are each on their own line. Lines are separated by a newline character as in normal ASCII text. Values in each tuple are separated by tab characters.

The first line of input must be a header containing meta data. The "column headers" are separated the same way all rows are. Each column header is of the form "name: type" where ": type" is optional. "type" should be one of "string" (the default), "name", "integer", "decimal", "date", or "/regular expression/". Maybe "counting", "natural", "integer", and "rational" for numbers. Maybe (start, end) or [start .. end) for intervals? Different types could use their natural orders, dictionary for names and/or strings, chronological for dates and times. Maybe we should support only one format for dates and times: yyyy-mm-dd hh:mm:ss.

    player jersey: integer
    Eric   17
    Fred   10
    

Such a file could be edited in any text editor. To ensure consistency (integrity, validity, etc.) you would have to run a program like rcheck below. Some spreadsheet programs (like Gnumeric) can also be used to edit such files. It might be nice to have a specialized relation editing program. This might be something like a simple spreadsheet that enforces type adherence in each cell.

Maybe each header could be complicated as name[[: domain]: type]. This would allow a finer check for "union compatibility" of relations for set operations. See below.

Operations

Then we can have specialized tools like rprojection, rselection, rjoin, etc. Composite commands can be composed with POSIX pipes. Also could use rsort, rmerge, others? Maybe rsuss (or rguess) which would take a headless table, and deduce column types. Perhaps rcheck could check that values conform to type. Would need rel2html, rel2dif, etc.

Metadata

These are some preliminary ideas (2010-11-20) about extending the format.

  type: name = spec
  table: name

The table line would allow multiple tables in the same file. It would be immediately followed (on the next line) by the table as before. Maybe the table name could be followed (on the same line) by an equals sign and a relational algebraic expression? All the types should precede all the tables? The type line could define types so that the header line could be simpler. e.g. type: CountingNumber = Integer [1:] Oh yeah, that brings up fancier typing

  type: name = String (range for length or /regular expression/)
  type: name = Integer (range or list)
  type: name = value from table.column
  type: name = Rational (significant digits) [range or list]

where

 range is [min: max] where min or max could be omitted.
 list is [one, two, ...]

The primitive types might be String, Integer, Boolean, Rational. Either min or max above could be omitted. If type is omitted then String is assumed for a list and Integer for a range. The third form implies a single column projection from another table. The Boolean form would allow for "true" - "false", 1 - 0, "on" - "off", etc.

What about defaults? The first thing in a list could be the default. (For Booleans this might conflict with the true value coming first.) Booleans have other issues like maybe false = 0 and true being anything else. Maybe it should be something like type: Boolean false = value true = * or something like it where the asterisk means anything other than the value for false. The first truth value would be the default. Maybe an explicit default: value?

Rational numbers pose an interesting issue. Some might want fixed decimals. Some might want to specify significant digits. Ranges (intervals) could be open, closed, or half open. Well, with significant digits I suppose there really are no closed or half open intervals. e.g. (0:1] with three significant digits would be like [0.001:1]? Should there be a default number of significant digits?

Currency is really an integer in disguise. e.g. pennies instead of dollars. Maybe the Integer type should be expanded by specifying a power of ten? Then we don't need the idea of Rationals with a fixed number of decimal places.

Don't forget an empty entry in the list, if that is allowed. Should empty values of other types be allowed? That's like having null in the relation which folks like C.J. Date don't like.

Properties

Maybe meta data could be in a separate file? A Java properties file or a JSON file?

HTML

Could extract tables from HTML if each table had a "name" or "id" and each had a first row of headers. Where would the meta data go? Maybe in a "type" attribute?

Questions


Valid HTML 4.01! Eric Blossom http://www.BlossomAssociates.net/tables/relations.html 2005-12-02, 2007-05-15, 2010-11-20