Table Manipulations

Some table manipulation programs. Most are conversions between different forms:

tab-separated-values (TSV) This is a convenient format for simple tables. Unix utilities like sort, cut, paste, and join work well on simple tables of tab separated values. Although sort needs -t"^I" and awk needs -F"^I". Use vi with ":set tabstop=20" or some larger number for a convenient editor. The Unix utilities often have an option to specify a delimiter other than tab. Some other common ones are colon, vertical bar, comma, or semicolon.
comma-separated-values (CSV) This format is a bit more complicated. commas separate fields. If a field contains a comma then the whole field is quoted with double quotes. If a field contains a double quote character then the field is quoted and the quote characters in the field are doubled up.
hypertext (HTML)
Plain Text This is what is commonly spit out by data base software like DB2. There is a line of column headings followed by a line of dashes. The rest of the table follows with spaces inserted to make the columns line up. /rdb is a set of shell scripts that can manipulate this format as well as implement a relational algebra.
RDB This is TSV with some header lines to name columns and indicate data type for a column. The header lines can be preceeded by commentary. It is packaged with a set of shell scripts and filters for doing queries. An interesting extension of this is NoSQL.
DIF, SYLK, and other formats come from the spreadsheet world. We do not use these as spreadsheets these days can read TSV, CSV, and even HTML.

HTML makes a good output format. You can get an ASCII pretty print by piping to lynx. e.g. tsv2html table.tsv | lynx -stdin -dump

columns.awk
columns.py
columns.rb
csv2tsv.c
csv2tsv.rb
csv2tsv2.c
db2tsv.awk
fillHTMLfromTSV.awk
fs.awk
qTable.awk
tsv2html
tsv2html.awk
tsv2html.pl
tsv2html.py
tsv2html.rb
tsv2html.sed
tsv2html3.awk
tsv2htmlplus.awk
unquote.awk
rdb2html.awk
tsv2rdb.awk
tsv2txt.awk
txt2tsv.awk

The Perl program, cvs2tsv.pl, doesn't work as can be seen by testing it with test.csv. Compare with:
./csv2tsv <test.csv | unquote.awk -F \\t OFS=\\t | cat -vt

For other text formats see ESR's Art of Unix Programming.

Metadata

Simple tables contain just data in rows and columns. Metadata can be introduced several ways. One could consider the HTML tags to be meta data, but just the tr and td tags are not really metadata. attribute values in HTML tags could contain metadata. The simplest and most common bit of metadata are column headings in the first line. Some utilities like cut, paste, and even join still work with such files. It breaks sort, but try head -1 file.tsv; sed '1d' file.tsv | sort -t"^I" .... Many of the above scripts are designed to allow or even expect such headings. The th HTML tag can be used for this. Other metadata is sometimes started with a # which is interepreted as a comment to be ignored by shell scripts, Perl, Ruby, and other languages. The above scripts do not handle this. RFC 822 or #%key=value can be used to store a hash of metadata. The above scripts do not handle this. See also Relation ASCII.

Eric@BlossomAssociates.Net 2005-12-02