Cozis

The Scanner Pattern

Over the years, I have written quite a bit of string manipulation and parsing code in C. Unfortunately, C has no runtime checks that can catch weird behaviors like out-of-bounds array accesses. For this reason, I came up with some safe patterns to follow that allow me to write pretty robust code from the get-go. Of course, one can always make mistakes, but the chances of making them in the first place drop drastically. Not only that, but when one has standard ways of doing things and something breaks, it's quite easy to find the problem by finding where the pattern breaks. The umbrella term I use for these techniques is "The Scanner Pattern", as they all revolve around a "Scanner" object.

The Problem

As I was saying, the Scanner pattern is used to parse information in textual format, such as JSON or HTTP. We usually want textual and flexible formats as they are easier to work with for us humans, but the same can't be said for machines. That's why programs usually translate text into a more native binary representation.

Say we want to write a program that's capable of reading a list of names from a file to determine whether a given name is contained or not. The file uses a human-friendly format:

BEGIN
  Linus  Torvalds
  Bill   Gates
  Steve  Jobs
  Mark   Facebook
END

It starts and ends with the BEGIN and END keywords, and has a bunch of names in between. The format is insensitive to spaces, so you can have zero or more spaces before and after names and one or more between name and surname.

The result should be an array of structs, each containing a name:

1typedef struct {
2 char *ptr;
3 int len;
4} String;
5
6typedef struct {
7 String name;
8 String surname;
9} Person;
10
11#define MAX_PEOPLE 100
12
13#define S(X) (String) { (X), (int) sizeof(X)-1 }
14
15int main(void)
16{
17 String file_contents = S("BEGIN\n Linus Torvalds\n Bill Gates\n Steve Jobs\n Mark Facebook\nEND");
18
19 int num_people = 0;
20 Person people[MAX_PEOPLE];
21
22 // TODO: parse file_contents and fill up the people array
23
24 // TODO: check that the list contains a given name
25
26 return 0;
27}
28

The Scanner

The Scanner Pattern is called that way because it's based on a Scanner object:

1typedef struct {
2 char *src;
3 int len;
4 int cur;
5} Scanner;
6

The string given by the src pointer and len length is the input data. The cur field contains the next character that needs to be read and is advanced as the program reads the string and extracts information. The cursor starts at 0 and parsing is complete when it reaches the end and is equal to the length of the string.

1Scanner s = { file_contents.ptr, file_contents.len, 0 };
2

I usually use a very short name for it as it's referenced a lot.

Consuming Specific Sequences

BEGIN\n  Linus  Torvalds\n  Bill   Gates\n  Steve  Jobs\n  Mark   Facebook\nEND
^
cur

The first thing is that we need to check that the file starts with the sequence "BEGIN" followed by a line feed. If it doesn't, we fail the parsing. If it does, we advance the cursor.

1if (s.cur == s.len || s.src[s.cur] != 'B')
2 return -1; // Error
3s.cur++;
4

This code fails if the cursor doesn't point to the B character. This may be because there is a different character or the string is empty. At the end of this code, if it didn't error, the cursor will point to the second character. We can apply this code multiple times, once per character:

1if (s.cur == s.len || s.src[s.cur] != 'B')
2 return -1; // Error
3s.cur++;
4
5if (s.cur == s.len || s.src[s.cur] != 'E')
6 return -1; // Error
7s.cur++;
8
9if (s.cur == s.len || s.src[s.cur] != 'G')
10 return -1; // Error
11s.cur++;
12
13if (s.cur == s.len || s.src[s.cur] != 'I')
14 return -1; // Error
15s.cur++;
16
17if (s.cur == s.len || s.src[s.cur] != 'N')
18 return -1; // Error
19s.cur++;
20
21if (s.cur == s.len || s.src[s.cur] != '\n')
22 return -1; // Error
23s.cur++;
24

This works but is quite verbose. We can group all the checks in a single if statement and then increment the cursor when all of them have passed.

1if (s.cur+0 == s.len || s.src[s.cur+0] != 'B' ||
2 s.cur+1 == s.len || s.src[s.cur+1] != 'E' ||
3 s.cur+2 == s.len || s.src[s.cur+2] != 'G' ||
4 s.cur+3 == s.len || s.src[s.cur+3] != 'I' ||
5 s.cur+4 == s.len || s.src[s.cur+4] != 'N' ||
6 s.cur+5 == s.len || s.src[s.cur+5] != '\n')
7 return -1; // Error
8s.cur += 6;
9

Note that since the last bounds check s.cur+5 == s.len implies the previous ones, so we can drop them:

1if (s.cur+5 >= s.len
2 || s.src[s.cur+0] != 'B'
3 || s.src[s.cur+1] != 'E'
4 || s.src[s.cur+2] != 'G'
5 || s.src[s.cur+3] != 'I'
6 || s.src[s.cur+4] != 'N'
7 || s.src[s.cur+5] != '\n')
8 return -1; // Error
9s.cur += 6;
10

If we wanted to be nitpicky, doing bounds checks as 5 >= s.len - s.cur would reduce the probability of overflowing.

Of course this type of check is only possible when the string we are looking for is known at compile time. We can write a more generic function using a for loop:

1bool consume_str(Scanner *s, String x)
2{
3 if (x.len == 0)
4 return false;
5
6 if (x.len > s->len - s->cur)
7 return false;
8
9 for (int i = 0; i < x.len; i++)
10 if (s->src[s->cur+i] != x.ptr[i])
11 return false;
12
13 s->cur += x.len;
14 return true;
15}
16

This function returns false if the string wasn't found at the cursor's position, while it returns true and advances the cursor otherwise.

We can use this function for the first and last tokens:

1Scanner s = { file_contents.ptr, file_contents.len, 0 };
2
3if (!consume_str(&s, S("BEGIN\n")))
4 return -1; // Error
5
6// .. parse names ...
7
8if (!consume_str(&s, S("END")))
9 return -1; // Error
10

Consuming Optional Sequences

BEGIN\n  Linus  Torvalds\n  Bill   Gates\n  Steve  Jobs\n  Mark   Facebook\nEND
       ^
       cur

In our example, every name is preceded by two spaces, but we would like this file format to be insensitive to spaces, which means we should also handle any number of spaces or no spaces at all. If there is some space before the element, we need to consume it. If there is no space, the cursor needs to stay still.

To consume a single optional character, we can use this if statement.

1if (s.cur < s.len && s.src[s.cur] == ' ')
2 s.cur++;
3

If the cursor is at the end of the file or there is something other than a space, the cursor is left unchanged, just as we wanted. Since we want to consume an arbitrary number of spaces, we can turn this into a loop:

1while (s.cur < s.len && s.src[s.cur] == ' ')
2 s.cur++;
3

This loop is useful whenever we want to consume an optional sequence of characters. We only need to define an expression that returns true for that sequence alone:

1while (s.cur < s.len && is_sequence(s.src[s.cur]))
2 s.cur++;
3

Consuming Required Sequences

BEGIN\n  Linus  Torvalds\n  Bill   Gates\n  Steve  Jobs\n  Mark   Facebook\nEND
         ^
         cur

Similarly to spaces, we can see a name as a sequence of letters. Unlike spaces, the name must be present in the file. The way we enforce this is by making sure that there is at least one letter character at the cursor's position. The remaining part of the name can be considered an optional sequence of letters:

1if (s.cur == s.len || !is_alpha(s.src[s.cur]))
2 return -1; // Missing name
3s.cur++;
4
5while (s.cur < s.len && is_alpha(s.src[s.cur]))
6 s.cur++;
7

BEGIN\n  Linus  Torvalds\n  Bill   Gates\n  Steve  Jobs\n  Mark   Facebook\nEND
              ^
              cur

Extracting Substrings

We managed to identify the start and end position of the first name, but how do we extract that information? This is where our String type comes in handy.

If we relied on null-terminated strings, we would need to copy the name character by character while we are consuming it into a buffer in order to append a null byte after it. With our string type we don't need to do that. We can create substrings of larger strings by simply pointing inside them!

Say our cursor is pointing to the start of a token we are interested in:

BEGIN\n  Linus  Torvalds\n  Bill   Gates\n  Steve  Jobs\n  Mark   Facebook\nEND
         ^
         cur = off

We can save the first character's position in a second variable called off and then consume the token until the end:

BEGIN\n  Linus  Torvalds\n  Bill   Gates\n  Steve  Jobs\n  Mark   Facebook\nEND
         ^    ^ cur
         off

The pointer to the string "Linus" can be evaluated by adding off to the start of the source string, and its length as the new cursor position minus the old one.

1// ... consume stuff before the token we are interested in ...
2
3int off = s.cur; // Save the index of the first character of the token
4
5// ... consume the token ...
6
7String name = { s.src + off, s.cur - off };
8
9// ... continue parsing ...
10

This way we can effortlessly extract substrings by simply knowing where each token starts and ends!

Parsing the Entire Line

You may have noticed that we have everything we need now. We managed to parse the first name of the first item and the spaces preceding it. The name is then followed by more spaces, then a surname, and then more spaces. We just need to apply the same patterns again and again.

Let's define a couple of helper functions:

1bool is_space(char c)
2{
3 return c == ' ';
4}
5
6bool is_alpha(char c)
7{
8 return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
9}
10
11void consume_spaces(Scanner *s)
12{
13 while (s->cur < s->len && is_space(s->src[s->cur]))
14 s->cur++;
15}
16
17bool consume_word(Scanner *s)
18{
19 if (s->cur == s->len || !is_alpha(s->src[s->cur]))
20 return false;
21 s->cur++;
22
23 while (s->cur < s->len && is_alpha(s->src[s->cur]))
24 s->cur++;
25
26 return true;
27}
28

Parsing a line from the file becomes quite easy:

1consume_spaces(&s);
2
3int name_off = s.cur;
4
5if (!consume_word(&s))
6 return -1; // Error
7
8String name = { s.src + name_off, s.cur - name_off };
9
10consume_spaces(&s);
11
12int surname_off = s.cur;
13
14if (!consume_word(&s))
15 return -1;
16
17String surname = { s.src + surname_off, s.cur - surname_off };
18
19consume_spaces(&s);
20
21if (s.cur == s.len || s.src[s.cur] != '\n')
22 return -1;
23s.cur++;
24
25if (num_people == MAX_PEOPLE)
26 return -1;
27people[num_people++] = (Person) { name, surname };
28

And to parse the entire file we need to perform this sequence over and over until the END keyword is found.

Putting Everything Together

And here is the final result! By following these patterns correctly, we can be quite sure that our parser is rock-solid! And we didn't need extra allocations or anything :D

1#include <stdio.h>
2#include <stdbool.h>
3
4typedef struct {
5 char *ptr;
6 int len;
7} String;
8
9typedef struct {
10 String name;
11 String surname;
12} Person;
13
14typedef struct {
15 char *src;
16 int len;
17 int cur;
18} Scanner;
19
20#define MAX_PEOPLE 100
21
22#define S(X) (String) { (X), (int) sizeof(X)-1 }
23
24bool eqstr(String s1, String s2)
25{
26 if (s1.len != s2.len)
27 return false;
28 for (int i = 0; i < s1.len; i++)
29 if (s1.ptr[i] != s2.ptr[i])
30 return false;
31 return true;
32}
33
34bool is_space(char c)
35{
36 return c == ' ';
37}
38
39bool is_alpha(char c)
40{
41 return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
42}
43
44bool consume_str(Scanner *s, String x)
45{
46 if (x.len == 0)
47 return false;
48
49 if (x.len > s->len - s->cur)
50 return false;
51
52 for (int i = 0; i < x.len; i++)
53 if (s->src[s->cur+i] != x.ptr[i])
54 return false;
55
56 s->cur += x.len;
57 return true;
58}
59
60void consume_spaces(Scanner *s)
61{
62 while (s->cur < s->len && is_space(s->src[s->cur]))
63 s->cur++;
64}
65
66bool consume_word(Scanner *s)
67{
68 if (s->cur == s->len || !is_alpha(s->src[s->cur]))
69 return false;
70 s->cur++;
71
72 while (s->cur < s->len && is_alpha(s->src[s->cur]))
73 s->cur++;
74
75 return true;
76}
77
78int main(void)
79{
80 String file_contents = S("BEGIN\n Linus Torvalds\n Bill Gates\n Steve Jobs\n Mark Facebook\nEND");
81
82 int num_people = 0;
83 Person people[MAX_PEOPLE];
84
85 Scanner s = { file_contents.ptr, file_contents.len, 0 };
86
87 if (!consume_str(&s, S("BEGIN\n")))
88 return -1; // Error
89
90 for (;;) {
91
92 if (consume_str(&s, S("END")))
93 break;
94
95 consume_spaces(&s);
96
97 int name_off = s.cur;
98
99 if (!consume_word(&s))
100 return -1; // Error
101
102 String name = { s.src + name_off, s.cur - name_off };
103
104 consume_spaces(&s);
105
106 int surname_off = s.cur;
107
108 if (!consume_word(&s))
109 return -1;
110
111 String surname = { s.src + surname_off, s.cur - surname_off };
112
113 consume_spaces(&s);
114
115 if (s.cur == s.len || s.src[s.cur] != '\n')
116 return -1;
117 s.cur++;
118
119 if (num_people == MAX_PEOPLE)
120 return -1;
121 people[num_people++] = (Person) { name, surname };
122 }
123
124 bool found = false;
125 for (int i = 0; i < num_people; i++) {
126 if (eqstr(people[i].name, S("Bill")) &&
127 eqstr(people[i].surname, S("Gates"))) {
128 found = true;
129 break;
130 }
131 }
132 if (found)
133 printf("We found Bill Gates!\n");
134 else
135 printf("Not found\n");
136
137 return 0;
138}
139