csv-reading:   Comma-Separated Value (CSV) Parsing
1 Introduction
2 Reader Specs
3 Making Reader Makers
make-csv-reader-maker
4 Making Readers
make-csv-reader
5 High-Level Conveniences
csv-for-each
csv-map
csv->list
6 Converting CSV to SXML
csv->sxml
7 History
8 Legal
3:4

csv-reading: Comma-Separated Value (CSV) Parsing

Neil Van Dyke

1 Introduction

The csv-reading package for Racket provides utilities for reading various kinds of what are commonly known as “comma-separated value” (CSV) files. Since there is no standard CSV format, this library permits CSV readers to be constructed from a specification of the peculiarities of a given variant. A default reader handles the majority of formats.
One of the main uses of this library is to import data from old crusty legacy applications into Scheme for data conversion and other processing. To that end, this library includes various conveniences for iterating over parsed CSV rows, and for converting CSV input to SXML format.

2 Reader Specs

CSV readers are constructed using reader specs, which are sets of attribute-value pairs, represented in Scheme as association lists keyed on symbols. Each attribute has a default value if not specified otherwise. The attributes are:
  • newline-type Symbol representing the newline, or record-terminator, convention. The convention can be a fixed character sequence ('lf, 'crlf, or 'cr, corresponding to combinations of line-feed and carriage-return), any string of one or more line-feed and carriage-return characters ('lax), or adaptive ('adapt). 'adapt attempts to detect the newline convention at the start of the input and assume that convention for the remainder of the input. Default: 'lax

  • separator-chars Non-null list of characters that serve as field separators. Normally, this will be a list of one character. Default: '(#\,) (list of the comma character)

  • quote-char Character that should be treated as the quoted field delimiter character,or #f if fields cannot be quoted. Note that there can be only one quote character. Default: #\" (double-quote)

  • quote-doubling-escapes? Boolean for whether or not a sequence of two quote-char quote characters within a quoted field constitute an escape sequence for including a single quote-char within the string. Default: #t

  • comment-chars List of characters, possibly null, which comment out the entire line of input when they appear as the first character in a line. Default: '() (null list)

  • whitespace-chars List of characters, possibly null, that are considered whitespace constituents for purposes of the strip-leading-whitespace? and strip-trailing-whitespace? attributes described below. Default: '(#\space) (list of the space character)

  • strip-leading-whitespace? Boolean for whether or not leading whitespace in fields should be stripped. Note that whitespace within a quoted field is never stripped. Default: #f

  • strip-trailing-whitespace? Boolean for whether or not trailing whitespace in fields should be stripped. Note that whitespace within a quoted field is never stripped. Default: #f

  • newlines-in-quotes? Boolean for whether or not newline sequences are permitted within quoted fields. If true, then the newline characters are included as part of the field value; if false, then the newline sequence is treated as a premature record termination. Default: #t

3 Making Reader Makers

CSV readers are procedures that are constructed dynamically to close over a particular CSV input and yield a parsed row value each time the procedure is applied. For efficiency reasons, the reader procedures are themselves constructed by another procedure, make-csv-reader-maker, for particular CSV reader specs.

procedure

(make-csv-reader-maker reader-spec)

  
(-> (or/c input-port? string?)
    (-> (listof string?)))
  reader-spec : csv-reader-spec?
Constructs a CSV reader constructor procedure from the reader-spec, with unspecified attributes having their default values.
For example, given the input file "fruits.csv" with the content:

"fruits.csv"

apples  |  2 |  0.42

bananas | 20 | 13.69

 

a reader for the file’s apparent format can be constructed like:
(define make-food-csv-reader
  (make-csv-reader-maker
   '((separator-chars            #\|)
     (strip-leading-whitespace?  . #t)
     (strip-trailing-whitespace? . #t))))
The resulting make-food-csv-reader procedure accepts one argument, which is either an input port from which to read, or a string from which to read. Our example input file then can be be read by opening an input port on a file and using our new procedure to construct a reader on it:
(define next-row
  (make-food-csv-reader (open-input-file "fruits.csv")))
This reader, next-row, can then be called repeatedly to yield a parsed representation of each subsequent row. The parsed format is a list of strings, one string for each column. The null list is yielded to indicate that all rows have already been yielded.
> (next-row)
  ("apples" "2" "0.42")
> (next-row)
  ("bananas" "20" "13.69")
> (next-row)
  ()

4 Making Readers

In addition to being constructed from the result of make-csv-reader-maker, CSV readers can also be constructed using make-csv-reader.

procedure

(make-csv-reader in [reader-spec])  (-> (listof string))

  in : (or/c input-port? string?)
  reader-spec : csv-reader-spec = '()
Construct a CSV reader on the input in, which is an input port or a string. If reader-spec is given, and is not the null list, then a “one-shot” reader constructor is constructed with that spec and used. If reader-spec is not given, or is the null list, then the default CSV reader constructor is used. For example, the reader from the make-csv-reader-maker example could alternatively have been constructed like:
(define next-row
  (make-csv-reader
   (open-input-file "fruits.csv")
   '((separator-chars            #\|)
     (strip-leading-whitespace?  . #t)
     (strip-trailing-whitespace? . #t))))

5 High-Level Conveniences

Several convenience procedures are provided for iterating over the CSV rows and for converting the CSV to a list.
To the dismay of some Scheme purists, each of these procedures accepts a reader-or-in argument, which can be a CSV reader, an input port, or a string. If not a CSV reader, then the default reader constructor is used. For example, all three of the following are equivalent:
(csv->list                                     string)
 
(csv->list (make-csv-reader                    string))
 
(csv->list (make-csv-reader (open-input-string string)))

procedure

(csv-for-each proc reader-or-in)  any

  proc : (-> (listof string?) any)
  reader-or-in : 
(or/c (-> (listof string?))
      input-port?
      string?)
Similar to Racket’s for-each, applies proc, a procedure of one argument, to each parsed CSV row in series. reader-or-in is the CSV reader, input port, or string. The return value is undefined.

procedure

(csv-map proc reader-or-in)  any/c

  proc : (-> (listof string?) any/c)
  reader-or-in : 
(or/c (-> (listof string?))
      input-port?
      string?)
Similar to Racket’s map, applies proc, a procedure of one argument, to each parsed CSV row in series, and yields a list of the values of each application of proc, in order. reader-or-in is the CSV reader, input port, or string.

procedure

(csv->list reader-or-in)  (listof (listof string?))

  reader-or-in : 
(or/c (-> (listof string?))
      input-port?
      string?)
Yields a list of CSV row lists from input reader-or-in, which is a CSV reader, input port, or string.

6 Converting CSV to SXML

The csv->sxml procedure can be used to convert CSV to SXML format, for processing with various XML tools.

procedure

(csv->sxml reader-or-in    
  [row-element    
  col-elements])  sxml/xexp
  reader-or-in : 
(or/c (-> (listof string?))
      input-port?
      string?)
  row-element : symbol? = 'row
  col-elements : (listof symbol?) = '()
Reads CSV from input reader-or-in (which is a CSV reader, input port, or string), and yields an SXML/xexp representation. If given, row-element is a symbol for the XML row element. If row-element is not given, the default is the symbol row. If given col-elements is a list of symbols for the XML column elements. If not given, or there are more columns in a row than given symbols, column element symbols are of the format “col-n” where n is the column number (the first column being number 0, not 1).
For example, given a CSV-format file "friends.csv" that has the contents:

"friends.csv"

Binoche,Ste. Brune,33-1-2-3

Posey,Main St.,555-5309

Ryder,Cellblock 9,

 

with elements not given, the result is:

> (csv->sxml (open-input-file "friends.csv"))

  (*TOP* (row (col-0 "Binoche") (col-1 "Ste. Brune")  (col-2 "33-1-2-3"))
         (row (col-0 "Posey")   (col-1 "Main St.")    (col-2 "555-5309"))
         (row (col-0 "Ryder")   (col-1 "Cellblock 9") (col-2 "")))
With elements given, the result is like:
> (csv->sxml (open-input-file "friends.csv")
             'friend
             '(name address phone))
  (*TOP* (friend (name    "Binoche")
                 (address "Ste. Brune")
                 (phone   "33-1-2-3"))
         (friend (name    "Posey")
                 (address "Main St.")
                 (phone   "555-5309"))
         (friend (name    "Ryder")
                 (address "Cellblock 9")
                 (phone   "")))

7 History

8 Legal

Copyright 2004, 2005, 2008–2012, 2016 Neil Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See http://www.gnu.org/licenses/ for details. For other licenses and consulting, please contact the author.