Our topics this week included
object serialization
CSV
JSON
XML
regular expressions
C# (and many other languages) provide the capability to serialize objects in memory into binary data. This can be useful for storing data in disk files, or transferring it across a network.
You can add the [Serializable] attribute to a class to indicate that it is serializable. The BinaryFormatter class has methods that you can use to serialize and deserialize objects. It can serialize entire graphs of objects: in other words, if you serialize object A, which points to B, which points to C, then all of these objects will be included in the binary data. If you later read that data into memory and deserialize it, all of the objects will be reconstructed.
Binary serialization has some disadvantages. The binary format is specific to .NET, and by its nature is difficult to inspect and debug. Furthermore, the C# serialization mechanism has security vulnerabilities: if an attack can craft binary data and convince you to deserialize it, they can cause you to execute arbitrary code. For this reason, Microsoft discourages using this mechanism. I haven't documented its API in our class library quick reference.
CSV (comma-separated values) is an extremely common format for storing and exchanging data. The basic concept of CSV is that a file represents a table of data. Each line in the file represents a data record, with fields separated by commas. The first line of the file may contain the names of the fields.
Here are the first few lines of a CSV file containing data about cities in the world:
city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id "Tokyo","Tokyo",35.6897,139.6922,"Japan","JP","JPN","Tōkyō","primary",37977000,1392685764 "Jakarta","Jakarta",-6.2146,106.8451,"Indonesia","ID","IDN","Jakarta","primary",34540000,1360771077 ...
Unfortunately the details of the CSV format may vary from one file to another. In some CSV files the delimiters between fields are semicolons, tabs, or single or multiple spaces. In some files all fields are quoted; in some (as in the example above) only some are quoted, or none are. RFC 4180 attempted to standardize the format, but not all programs follow its specification.
There are several ways to parse CSV data in a program:
Parse the data manually, by writing code that loops over characters. The file format is simple enough that this may not be so difficult, though edge cases such as backquotes in strings can sometimes cause trouble.
Use regular expressions (discussed below) to help with parsing.
Use a CSV library such as CsvHelper. (There is no CSV support in the standard .NET library.)
JSON is another extremely common text-based format for storing and transferring data. Unlike CSV files, JSON can represent hierarchical data.
JSON is based on a subset of JavaScript's syntax. (It also looks a lot like Python). Every JSON value is one of the following:
a number
a string (in double quotes, with backslash escapes allowed inside)
a boolean: true or false
an array, with comma-separated elements in square brackets
an object (i.e. a dictionary), in which every key must be a string
null
Here is JSON data representing a person:
{ "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 27, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" } ], "children": [], "spouse": null }
In the C# library, classes in the System.Text.Json namespace can work with JSON data. You can parse a JSON string into a JsonDocument object, and then inspect the resulting data using the methods of the JsonElement class.
A higher-level way to work with JSON is use the JsonSerializer class, which can serialize C# objects to and from a JSON representation.
These classes and methods are described in our quick library reference.
XML is another very common format for storing hierarchical data in text files. It looks similar to HTML. Here is a snippet of XML:
<breakfast_menu> <food> <name>Belgian Waffles</name> <price>$5.95</price> <description> Two of our famous Belgian Waffles with plenty of real maple syrup </description> <calories>650</calories> </food> <food> <name>Strawberry Belgian Waffles</name> <price>$7.95</price> <description> Light Belgian waffles covered with strawberries and whipped cream </description> <calories>900</calories> </food> <food> <name>Berry-Berry Belgian Waffles</name> <price>$8.95</price> <description> Light Belgian waffles covered with an assortment of fresh berries and whipped cream </description> <calories>900</calories> </food> </breakfast_menu>
Many older applications store data in an XML-based format. In recent years there has been a trend toward using JSON rather than XML for representing data, probably because JSON is less verbose and more directly maps to common data types in programming languages.
All major languages including C# have libraries for working with XML data. In the .NET library, the XmlReader and XmlWriter classes in the System.Xml namespace can read and write XML. (I have not added these classes to our quick library reference.)
Regular expressions are a mini-language for matching text. The commonly used regular expression syntax originated in Unix systems in the 1970s, though regular expressions have a theoretical basis in finite automata which was studied well before that. They are very commonly used in text processing, and all major languages including C# have regular expression libraries.
The syntax of regular expressions is mostly identical in various implementations of regular expressions, though some details can vary from implementation to implementation. The most basic elements of regular expressions are as follows:
.
[…]
[ace]
will match the
characters 'a', 'c', or 'e'. A set may also contain character
ranges; for example, [0-9a-dA-D]
will match any decimal
digit, the lowercase characters from 'a' to 'd', or the uppercase
characters from 'A' to 'D'.[^…]
\d
\w
?
|
expr2^
$
In addition, any character that is not listed above will match itself.
For example:
\d\d\d-\d\d\d-\d\d\d\d
will
match a US telephone number such as 228-334-9292
.
\d+(\.\d+)?
will match a decimal
number with an optional fractional part, such as 123 or 225.682.
Notice that we must write \.
to match only a decimal
point, since .
will normally match any character.
[a-z]+\s+[a-z]+\s+[a-z]+
will
match three words, separated by whitespace.
\d\d-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d\d\d\d
will match a date such as 22-May-2021.
Our quick library reference lists useful classes and methods in the System.Text.RegularExpressions namespace. The function RegEx.IsMatch() returns true if a match for a pattern is found anywhere in an input string:
RegEx.IsMatch("one two three 123 four five", "\d\d\d") // returns true
The function RegEx.Match() looks for a match in an input string, and returns a Match object. If the match succeeds, the returned object's Success property will be true, and the Value property will contain the text that matched:
Match m = RegEx.Match("one two three 123 four five", "\d\d\d"); if (m.Success) WriteLine(m.Value);
The function RegEx.Matches() looks for all matches of a pattern in an input string, and returns a MatchCollection object that lists all the matches. A MatchCollection is iterable, and also allows you to access individual matches by index:
MatchCollection matches = RegEx.Matches("dog cat pig horse bear", "[^aeiou ][aeiou][^aeiou ]"); WriteLine(matches[0].Value); // writes "dog" WriteLine(matches[1].Value); // writes "cat"