1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332
|
-------------------------------------------------------------------
CSV file format
-------------------------------------------------------------------
The CSV format is supposedly one of the most 'standard' interchange
formats between data based programs. Almost all data based applications
seem to be able to export CSV formatted data, and almost all have
a way to import it.
However, the format is anything BUT standard. There are quite a
few variations that are very important to understand.
-------------------------------------------------------------------
DATA ENCAPSULATION
-------------------------------------------------------------------
Data is either 'naked' (without encapsulating doublequotes), or quoted.
Quoted data is used to protect imbedded carriage returns, imbedded
commas, odd characters and of course, the quote character itself.
Quotes that are 'data' are doubled up.
-------------------------------------------------------------------
UNQUOTED ENCAPSULATION
-------------------------------------------------------------------
Data that does NOT contain newlines, carriage returns, comma's or
quotes (or ASCII data below 0x20 or above 0x7f) stands on its
own:
data <1234> followed by <The Big Ol' Bear>
becomes:
1234,The Big Ol' Bear{CR}
-------------------------------------------------------------------
QUOTED ENCAPSULATION
-------------------------------------------------------------------
Data that contains double quotes, commas, returns or other odd
characters outside the 7-bit ASCII character set is quoted.
data <1234 Harrington St, Northwest> followed by <Suite 17 Stop 3>
becomes:
"1234 Harrington St, Northwest",Suite 17 Stop 3
Note that the second chunk of data does not have the protection
of the quotes: it doesn't need it, having no odd characters within.
<1234 West "Q" St.>
-------------------------------------------------------------------
QUOTED ENCAPSULATION OF DOUBLE QUOTE
-------------------------------------------------------------------
Data that contains double quotes is a special case, and oddly
interpreted by all nature of commercial programs. Consider the
data:
<1234 West "Q" St.>
It should become:
"1234 West ""Q"" St." <<< RIGHT WAY
Where each internal DATA quote is doubled up. However, some
programs (such as Paradox) don't do so nicely. They would represent
the data as:
"1234 West "Q" St.",.... <<< WRONG WAY
Where they feel that a doublequote with no comma following is part of
the data. This turns out to be rather bogus, as it can be painful under
various common circumstances (e.g.:
<1234 West "Q" St. (for "Quantum", or "Quality")>
becomes
"1234 West "Q" St. (for "Quantum", or "Quality")",
Which of course is entirely ambiguous as to the placement of data in
the field.
-------------------------------------------------------------------
QUOTED ENCAPSULATION OF CARRIAGE RETURNS
-------------------------------------------------------------------
Often times data has imbedded carriage returns. YOU MUST MAKE SURE
YOUR ROUTES HANDLE THIS CORRECTLY, as it is a VERY common case:
<Thomas Aquinus, Esq.{CR}
Prosecutor for the Pope{CR}
St. Luke's Dungeon{CR}
Somewhere in Italy>
becomes
"Thomas Aquinus, Esq.{CR}
Prosecutor for the Pope{CR}
St. Luke's Dungeon{CR}
Somewhere in Italy"
What makes this so blasted difficult is that at the 'outer level',
most CSV parsing code is trying to evaluate LINES, not fields. However,
the imbedded CR louses up the logic. OPTIMALLY, you should code
your CSV reader to have a 'field-by-field' read logic. If this is
impossible (or awkward, as in PERL), you may want to have conditional
logic looking for a trailing and unmatched doublequote.
-------------------------------------------------------------------
EMPTY DATA CONVENTIONS
-------------------------------------------------------------------
When data is 'empty' (either the empty string, or the numeric
value '0', or FALSE if boolean), you have the option of either
writing out:
""
or writing nothing at all. Therefore, it is common to see CSV files
that look like both of these examples:
"","Thos.","","Aquinus","Esq.","Pros.forPope","","Somewhere..."
,Thos.,,Aquinus,Esq,Pros.forPope,,Somewhere...
They're both equivalent, and do not violate the spirit of the CSV
standard.
-------------------------------------------------------------------
REMOVING PADDING
-------------------------------------------------------------------
Though not strictly required, it is conventional to remove empty
space padding before and after data. Likewise, it is conventional
to remove empty padding when READING data [on the off chance that
someone forgot to remove it when writing the data out!]
-------------------------------------------------------------------
THE FIELD-DESCRIPTION HEADER
-------------------------------------------------------------------
Surprisingly there is absolutly no part of the standard that either
calls for the first record to consist of the names of the fields
in the data following, or, that identifies the first record uniquely
as such should it exist!
However, it is both common AND EXPECTED that the first line should
consist of the names of the fields for the records that follow.
-------------------------------------------------------------------
VARIABLY FORMATTED CSV's
-------------------------------------------------------------------
Some programmers have abandonded the very precept of a regularly
formatted CSV file and have added their own twists. Groupwise
CSV export is one of these programs. They "enhance" the CSV
output format so that the first field of the record determines
the structure of the rest of the record. This is hopelessly
hard to cope with. DON'T DO IT IF YOU WANT TO REMAIN ALIVE.
This document was created in a fit of disgust with the complete
lack of CSV documentation found on the Internet.
Author: Robert J. Lynch
rlynch@lynchmarks.com
Copyright (c) 2001 |
Partager