Week 14: Notes

CSV

CSV (comma-separated values) is an extremely common format for storing and exchanging data. The basic concept of CSV is that a file represents a table of data. Each line in the file represents a data record, with fields separated by commas. The first line of the file may contain the names of the fields.

Here are the first few lines of a CSV file containing data about cities in the world:

city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population
"Tokyo","Tokyo",35.6897,139.6922,"Japan","JP","JPN","Tōkyō","primary",37977000
"Jakarta","Jakarta",-6.2146,106.8451,"Indonesia","ID","IDN","Jakarta","primary",34540000
...

Unfortunately the details of the CSV format may vary from one file to another. In some CSV files the delimiters between fields are semicolons, tabs, or single or multiple spaces. In some files all fields are quoted; in some (as in the example above) only some are quoted, or none are. RFC 4180 attempted to standardize the format, but not all programs follow its specification.

There are several ways to parse CSV data in a C# program:

Parse the data manually, by writing code that loops over characters. The file format is simple enough that this may not be so difficult, though edge cases such as backquotes in strings can sometimes cause trouble.
Use regular expressions (discussed below) to help with parsing.
Use the TextFieldParser class, found in the Microsoft.VisualBasic.FileIO namespace in the standard library.
Use an external CSV library such as CsvHelper.

If your program has non-hierarchical data that can be described as a series of records, I would generally recommend storing it in a CSV file. This format is simple and is supported by a wide variety of tools. In particular, virtually all spreadsheet programs can read data from CSV files, and it can often be convenient to read your data into a spreadsheet to view it in tabular format.

JSON

JSON is another extremely common text-based format for storing and transferring data. Unlike CSV files, JSON can represent hierarchical data.

JSON is based on a subset of JavaScript's syntax. (It also looks a lot like Python). Every JSON value is one of the following:

a number
a string (in double quotes, with backslash escapes allowed inside)
a boolean: true or false
an array, with comma-separated elements in square brackets
an object (i.e. a dictionary), in which every key must be a string
null

Here is JSON data representing a person:

{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null
}

In the C# library, classes in the System.Text.Json namespace can work with JSON data. You can parse a JSON string into a JsonDocument object, and then inspect the resulting data using the methods of the JsonElement class.

A higher-level way to work with JSON is use the JsonSerializer class, which can serialize C# objects to and from a JSON representation.

These classes and methods are described in our quick library reference.

XML

XML is another very common format for storing hierarchical data in text files. It looks similar to HTML. Here is a snippet of XML:

<breakfast_menu>
    <food>
        <name>Belgian Waffles</name>
        <price>$5.95</price>
        <description>
         Two of our famous Belgian Waffles with plenty of real maple syrup
        </description>
        <calories>650</calories>
    </food>
    <food>
        <name>Strawberry Belgian Waffles</name>
        <price>$7.95</price>
        <description>
Light Belgian waffles covered with strawberries and whipped cream
        </description>
        <calories>900</calories>
    </food>
    <food>
        <name>Berry-Berry Belgian Waffles</name>
        <price>$8.95</price>
        <description>
Light Belgian waffles covered with an assortment of fresh berries and whipped cream
        </description>
        <calories>900</calories>
    </food>
</breakfast_menu>

Many older applications store data in an XML-based format. In recent years there has been a trend toward using JSON rather than XML for representing data, probably because JSON is less verbose and more directly maps to common data types in programming languages.

All major languages including C# have libraries for working with XML data. In the .NET library, the XmlReader and XmlWriter classes in the System.Xml namespace can read and write XML. (I have not added these classes to our quick library reference.)

regular expressions

Regular expressions are a mini-language for matching text. The commonly used regular expression syntax originated in Unix systems in the 1970s, though regular expressions have a theoretical basis in finite automata which was studied well before that. They are very commonly used in text processing, and all major languages including C# have regular expression libraries.

You may have already encountered regular expressions in your UNIX class, since they are supported by command-line utilities such as 'sed'.

The syntax of regular expressions is mostly identical in various implementations of regular expressions, though some details can vary from implementation to implementation. The most basic elements of regular expressions are as follows:

.: Match any character except a newline.
[…]: Match any of a set of characters. Characters may be listed individually; for example, [ace] will match the characters 'a', 'c', or 'e'. A set may also contain character ranges; for example, [0-9a-dA-D] will match any decimal digit, the lowercase characters from 'a' to 'd', or the uppercase characters from 'A' to 'D'.
[^…]: Match any character that is not in the given set.
\d: Match any decimal digit.
\s: Match any whitespace character.
\w: Match any alphanumeric character, including underscores.
*: Match 0 or more repetitions of the preceding expression.
+: Match 1 or more repetitions of the preceding expression.
?: Match 0 or 1 repetitions of the preceding expression.
expr1 | expr2: Matches either expr1 or expr2.
(…): Indicates the start and end of a group.
^: Anchor to the start of the source string.
$: Anchor to the end of the source string.

In addition, any character that is not listed above will match itself.

For example:

\d\d\d-\d\d\d-\d\d\d\d will match a US telephone number such as 228-334-9292.
\d+(\.\d+)? will match a decimal number with an optional fractional part, such as 123 or 225.682. Notice that we must write \. to match only a decimal point, since . will normally match any character.
[a-z]+\s+[a-z]+\s+[a-z]+ will match three words, separated by whitespace.
\d\d-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d\d\d\d will match a date such as 22-May-2021.

regular expressions in C#

Our quick library reference lists useful classes and methods in the System.Text.RegularExpressions namespace. The function Regex.IsMatch() returns true if a match for a pattern is found anywhere in an input string:

Regex.IsMatch("one two three 123 four five", @"\d\d\d")  // returns true

Notice that the string containing the regular expression above is preceded by a @ character. That indicates that it is a verbatim string literal, in which backslashes are normal characters. For example, \n in a verbatim literal indicates two characters: a backslash and an n. (In an ordinary C# string literal, this would indicate a newline character). I generally recommend that you write all regular expressions in C# using verbatim literals. In ordinary string literals you would need to write each backslash twice (\\) to prevent it from being interpreted as an escape character, which would make your regular expressions harder to read.

The function Regex.Match() looks for a match in an input string, and returns a Match object. If the match succeeds, the returned object's Success property will be true, and the Value property will contain the text that matched:

Match m = Regex.Match("one two three 123 four five", @"\d\d\d");
if (m.Success)
    WriteLine(m.Value);

The function Regex.Matches() looks for all matches of a pattern in an input string, and returns a MatchCollection object that lists all the matches. A MatchCollection is iterable, and also allows you to access individual matches by index:

MatchCollection matches =
    Regex.Matches("dog cat pig horse bear", @"[^aeiou ][aeiou][^aeiou ]");
WriteLine(matches[0].Value);    // writes "dog"
WriteLine(matches[1].Value);    // writes "cat"

If a regular expression contains one or more groups, then the text that matched each group is accessible through a Match object's Groups collection:

Match m = Regex.Match("name = fred, age = 30", @"name = (\w+), age = (\w+)");
if (m.Success) {
    WriteLine(m.Groups[1].Value);   // writes "fred"
    WriteLine(m.Groups[2].Value);   // write "30"
}

records

Sometimes we write classes whose only purpose is to hold data. For example, consider a class that holds a geometric circle with a center and radius. We might define it like this:

class Circle {
    public readonly double x, y;
    public readonly double radius;

    public Circle(double x, double y, double radius) {
        (this.x, this.y) = (x, y);
        this.radius = radius;
    }

    public override bool Equals(object? o) =>
        o is Circle d && (x, y, radius) == (d.x, d.y, d.radius);

    public override int GetHashCode() => (x, y, radius).GetHashCode();

    public override String ToString() =>
        $"Circle: x = {x}, y = {y}, radius = {radius}";
}

(Above, the readonly attribute indicates that a field may be set only in a constructor, and is immutable after that.)

This is a lot of code to type to define a simple data class. Fortunately C# offers a more compact alternative, namely records. Instead of the definition above, we may define our circle class using a record:

record Circle(double x, double y, double radius);

That was a lot less typing! :) When you define a record, C# synthesizes a class that is similar to the class above. In particular:

The class is immutable: fields cannot be changed after an instance is initialized.
The class supports comparison by structural equality: two instances are equal if they have the same members (like in the definition of Equals() above).
The class has a GetHashCode() method so that its instances can be used as a hash key.
The class has a ToString() method that builds a string including all field values.

A record may have methods or other members such as properties and indexers. For example:

record Circle(double x, double y, double radius) {
    public double circumference() => 2 * PI * radius;
    
    public double area() => PI * PI * radius;
}