Week 12: Notes

Our topics this week included

object serialization
CSV
JSON
XML
regular expressions

object serialization

C# (and many other languages) provide the capability to serialize objects in memory into binary data. This can be useful for storing data in disk files, or transferring it across a network.

You can add the [Serializable] attribute to a class to indicate that it is serializable. The BinaryFormatter class has methods that you can use to serialize and deserialize objects. It can serialize entire graphs of objects: in other words, if you serialize object A, which points to B, which points to C, then all of these objects will be included in the binary data. If you later read that data into memory and deserialize it, all of the objects will be reconstructed.

Binary serialization has some disadvantages. The binary format is specific to .NET, and by its nature is difficult to inspect and debug. Furthermore, the C# serialization mechanism has security vulnerabilities: if an attack can craft binary data and convince you to deserialize it, they can cause you to execute arbitrary code. For this reason, Microsoft discourages using this mechanism. I haven't documented its API in our class library quick reference.

CSV

CSV (comma-separated values) is an extremely common format for storing and exchanging data. The basic concept of CSV is that a file represents a table of data. Each line in the file represents a data record, with fields separated by commas. The first line of the file may contain the names of the fields.

Here are the first few lines of a CSV file containing data about cities in the world:

city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
"Tokyo","Tokyo",35.6897,139.6922,"Japan","JP","JPN","Tōkyō","primary",37977000,1392685764
"Jakarta","Jakarta",-6.2146,106.8451,"Indonesia","ID","IDN","Jakarta","primary",34540000,1360771077
...

Unfortunately the details of the CSV format may vary from one file to another. In some CSV files the delimiters between fields are semicolons, tabs, or single or multiple spaces. In some files all fields are quoted; in some (as in the example above) only some are quoted, or none are. RFC 4180 attempted to standardize the format, but not all programs follow its specification.

There are several ways to parse CSV data in a program:

Parse the data manually, by writing code that loops over characters. The file format is simple enough that this may not be so difficult, though edge cases such as backquotes in strings can sometimes cause trouble.
Use regular expressions (discussed below) to help with parsing.
Use a CSV library such as CsvHelper. (There is no CSV support in the standard .NET library.)

JSON

JSON is another extremely common text-based format for storing and transferring data. Unlike CSV files, JSON can represent hierarchical data.

JSON is based on a subset of JavaScript's syntax. (It also looks a lot like Python). Every JSON value is one of the following:

a number
a string (in double quotes, with backslash escapes allowed inside)
a boolean: true or false
an array, with comma-separated elements in square brackets
an object (i.e. a dictionary), in which every key must be a string
null

Here is JSON data representing a person:

{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null
}

In the C# library, classes in the System.Text.Json namespace can work with JSON data. You can parse a JSON string into a JsonDocument object, and then inspect the resulting data using the methods of the JsonElement class.

A higher-level way to work with JSON is use the JsonSerializer class, which can serialize C# objects to and from a JSON representation.

These classes and methods are described in our quick library reference.

XML

XML is another very common format for storing hierarchical data in text files. It looks similar to HTML. Here is a snippet of XML:

<breakfast_menu>
    <food>
        <name>Belgian Waffles</name>
        <price>$5.95</price>
        <description>
         Two of our famous Belgian Waffles with plenty of real maple syrup
        </description>
        <calories>650</calories>
    </food>
    <food>
        <name>Strawberry Belgian Waffles</name>
        <price>$7.95</price>
        <description>
Light Belgian waffles covered with strawberries and whipped cream
        </description>
        <calories>900</calories>
    </food>
    <food>
        <name>Berry-Berry Belgian Waffles</name>
        <price>$8.95</price>
        <description>
Light Belgian waffles covered with an assortment of fresh berries and whipped cream
        </description>
        <calories>900</calories>
    </food>
</breakfast_menu>

Many older applications store data in an XML-based format. In recent years there has been a trend toward using JSON rather than XML for representing data, probably because JSON is less verbose and more directly maps to common data types in programming languages.

All major languages including C# have libraries for working with XML data. In the .NET library, the XmlReader and XmlWriter classes in the System.Xml namespace can read and write XML. (I have not added these classes to our quick library reference.)

regular expressions

Regular expressions are a mini-language for matching text. The commonly used regular expression syntax originated in Unix systems in the 1970s, though regular expressions have a theoretical basis in finite automata which was studied well before that. They are very commonly used in text processing, and all major languages including C# have regular expression libraries.

The syntax of regular expressions is mostly identical in various implementations of regular expressions, though some details can vary from implementation to implementation. The most basic elements of regular expressions are as follows:

.: Match any character except a newline.
[…]: Match any of a set of characters. Characters may be listed individually; for example, [ace] will match the characters 'a', 'c', or 'e'. A set may also contain character ranges; for example, [0-9a-dA-D] will match any decimal digit, the lowercase characters from 'a' to 'd', or the uppercase characters from 'A' to 'D'.
[^…]: Match any character that is not in the given set.
\d: Match any decimal digit.
\s: Match any whitespace character.
\w: Match any alphanumeric character, including underscores.
*: Match 0 or more repetitions of the preceding expression.
+: Match 1 or more repetitions of the preceding expression.
?: Match 0 or 1 repetitions of the preceding expression.
expr1 | expr2: Matches either expr1 or expr2.
(…): Indicates the start and end of a group.
^: Anchor to the start of the source string.
$: Anchor to the end of the source string.

In addition, any character that is not listed above will match itself.

For example:

\d\d\d-\d\d\d-\d\d\d\d will match a US telephone number such as 228-334-9292.
\d+(\.\d+)? will match a decimal number with an optional fractional part, such as 123 or 225.682. Notice that we must write \. to match only a decimal point, since . will normally match any character.
[a-z]+\s+[a-z]+\s+[a-z]+ will match three words, separated by whitespace.
\d\d-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d\d\d\d will match a date such as 22-May-2021.

regular expressions in C#

Our quick library reference lists useful classes and methods in the System.Text.RegularExpressions namespace. The function RegEx.IsMatch() returns true if a match for a pattern is found anywhere in an input string:

RegEx.IsMatch("one two three 123 four five", "\d\d\d")  // returns true

The function RegEx.Match() looks for a match in an input string, and returns a Match object. If the match succeeds, the returned object's Success property will be true, and the Value property will contain the text that matched:

Match m = RegEx.Match("one two three 123 four five", "\d\d\d");
if (m.Success)
    WriteLine(m.Value);

The function RegEx.Matches() looks for all matches of a pattern in an input string, and returns a MatchCollection object that lists all the matches. A MatchCollection is iterable, and also allows you to access individual matches by index:

MatchCollection matches =
    RegEx.Matches("dog cat pig horse bear", "[^aeiou ][aeiou][^aeiou ]");
WriteLine(matches[0].Value);    // writes "dog"
WriteLine(matches[1].Value);    // writes "cat"