CSV File Format

CSV Classes: Supported Format

The CSV classes follow fairly closely the format defined in RFC 4180 (http://tools.ietf.org/html/rfc4180 Common Format and MIME Type for Comma-Separated Values (CSV) Files).

Definition of the CSV Format

The following excerpt from RFC 4180 describe the official format, with variations described in the next section:

While there are various specifications and implementations for the CSV format, there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files. This section documents the format that seems to be followed by most implementations:

  1. Each record is located on a separate line, delimited by a line break (CRLF). For example:
    aaa,bbb,ccc CRLF
    zzz,yyy,xxx CRLF
    
  2. The last record in the file may or may not have an ending line break. For example:
    aaa,bbb,ccc CRLF
    zzz,yyy,xxx
    
  3. There may be an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional "header" parameter of this MIME type). For example:
    field_name,field_name,field_name CRLF
    aaa,bbb,ccc CRLF
    zzz,yyy,xxx CRLF
    
  4. Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma. For example:
    aaa,bbb,ccc
    
  5. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:
    "aaa","bbb","ccc" CRLF
    zzz,yyy,xxx
    
  6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:
    "aaa","b CRLF
    bb","ccc" CRLF
    zzz,yyy,xxx
    
  7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:
    "aaa","b""bb","ccc"
    

Variations from the RFC Format

The CsvReader & CsvWriter classes vary in their interpretation of the standard points above in the following ways:

  1. When reading, records (lines) may also be terminated simply by CR or LF. The CsvWriter uses CRLF to terminate records.
  2. The second point suggests that a blank line at the end (i.e. if a CRLF is used to terminate the last line) will not count as a record. The CsvReader class treats any blank line as an empty record – including one trailing the “last” record.
  3. The header line is not differentiated by the CsvReader but will be correctly read. It is up to the client application to choose to interpret the header line if desired.
  4. The CsvReader does not insist on each record containing the same number of fields – it is at the discretion of the client application to apply such a constraint in whatever way it sees fit (e.g. ignore, raise and exception or treat missing elements as nulls). Note that a trailing comma is treated as preceding an empty field (i.e. “”).
  5. When reading, fields starting with a double quote will be interpreted as a double quoted field. Special handling is provided on writing to detect if a field should be quoted based on its contents – the QuoteLimit variable determines the maximum length of field that will be checked and anything longer is automatically quoted.
  6. Supported when both reading and writing.
  7. Supported when both reading and writing.