The Scanner Pattern

Over the years, I have written quite a bit of string manipulation and parsing code in C. Unfortunately, C has no runtime checks that can catch weird behaviors like out-of-bounds array accesses. For this reason, I came up with some safe patterns to follow that allow me to write pretty robust code from the get-go. Of course, one can always make mistakes, but the chances of making them in the first place drop drastically. Not only that, but when one has standard ways of doing things and something breaks, it's quite easy to find the problem by finding where the pattern breaks. The umbrella term I use for these techniques is "The Scanner Pattern", as they all revolve around a "Scanner" object.

The Problem

As I was saying, the Scanner pattern is used to parse information in textual format, such as JSON or HTTP. We usually want textual and flexible formats as they are easier to work with for us humans, but the same can't be said for machines. That's why programs usually translate text into a more native binary representation.

Say we want to write a program that's capable of reading a list of names from a file to determine whether a given name is contained or not. The file uses a human-friendly format:

BEGIN
  Linus  Torvalds
  Bill   Gates
  Steve  Jobs
  Mark   Facebook
END

It starts and ends with the BEGIN and END keywords, and has a bunch of names in between. The format is insensitive to spaces, so you can have zero or more spaces before and after names and one or more between name and surname.

The result should be an array of structs, each containing a name:

1	typedef struct {
2	char *ptr;
3	int len;
4	} String;
5
6	typedef struct {
7	String name;
8	String surname;
9	} Person;
10
11	#define MAX_PEOPLE 100
12
13	#define S(X) (String) { (X), (int) sizeof(X)-1 }
14
15	int main(void)
16	{
17	String file_contents = S("BEGIN\n Linus Torvalds\n Bill Gates\n Steve Jobs\n Mark Facebook\nEND");
18
19	int num_people = 0;
20	Person people[MAX_PEOPLE];
21
22	// TODO: parse file_contents and fill up the people array
23
24	// TODO: check that the list contains a given name
25
26	return 0;
27	}
28

The Scanner

The Scanner Pattern is called that way because it's based on a Scanner object:

1	typedef struct {
2	char *src;
3	int len;
4	int cur;
5	} Scanner;
6

The string given by the src pointer and len length is the input data. The cur field contains the next character that needs to be read and is advanced as the program reads the string and extracts information. The cursor starts at 0 and parsing is complete when it reaches the end and is equal to the length of the string.

1	Scanner s = { file_contents.ptr, file_contents.len, 0 };
2

I usually use a very short name for it as it's referenced a lot.

Consuming Specific Sequences

BEGIN\n  Linus  Torvalds\n  Bill   Gates\n  Steve  Jobs\n  Mark   Facebook\nEND
^
cur

The first thing is that we need to check that the file starts with the sequence "BEGIN" followed by a line feed. If it doesn't, we fail the parsing. If it does, we advance the cursor.

1	if (s.cur == s.len \|\| s.src[s.cur] != 'B')
2	return -1; // Error
3	s.cur++;
4

This code fails if the cursor doesn't point to the B character. This may be because there is a different character or the string is empty. At the end of this code, if it didn't error, the cursor will point to the second character. We can apply this code multiple times, once per character:

1	if (s.cur == s.len \|\| s.src[s.cur] != 'B')
2	return -1; // Error
3	s.cur++;
4
5	if (s.cur == s.len \|\| s.src[s.cur] != 'E')
6	return -1; // Error
7	s.cur++;
8
9	if (s.cur == s.len \|\| s.src[s.cur] != 'G')
10	return -1; // Error
11	s.cur++;
12
13	if (s.cur == s.len \|\| s.src[s.cur] != 'I')
14	return -1; // Error
15	s.cur++;
16
17	if (s.cur == s.len \|\| s.src[s.cur] != 'N')
18	return -1; // Error
19	s.cur++;
20
21	if (s.cur == s.len \|\| s.src[s.cur] != '\n')
22	return -1; // Error
23	s.cur++;
24

This works but is quite verbose. We can group all the checks in a single if statement and then increment the cursor when all of them have passed.

1	if (s.cur+0 == s.len \|\| s.src[s.cur+0] != 'B' \|\|
2	s.cur+1 == s.len \|\| s.src[s.cur+1] != 'E' \|\|
3	s.cur+2 == s.len \|\| s.src[s.cur+2] != 'G' \|\|
4	s.cur+3 == s.len \|\| s.src[s.cur+3] != 'I' \|\|
5	s.cur+4 == s.len \|\| s.src[s.cur+4] != 'N' \|\|
6	s.cur+5 == s.len \|\| s.src[s.cur+5] != '\n')
7	return -1; // Error
8	s.cur += 6;
9

Note that since the last bounds check s.cur+5 == s.len implies the previous ones, so we can drop them:

1	if (s.cur+5 >= s.len
2	\|\| s.src[s.cur+0] != 'B'
3	\|\| s.src[s.cur+1] != 'E'
4	\|\| s.src[s.cur+2] != 'G'
5	\|\| s.src[s.cur+3] != 'I'
6	\|\| s.src[s.cur+4] != 'N'
7	\|\| s.src[s.cur+5] != '\n')
8	return -1; // Error
9	s.cur += 6;
10

If we wanted to be nitpicky, doing bounds checks as 5 >= s.len - s.cur would reduce the probability of overflowing.

Of course this type of check is only possible when the string we are looking for is known at compile time. We can write a more generic function using a for loop:

1	bool consume_str(Scanner *s, String x)
2	{
3	if (x.len == 0)
4	return false;
5
6	if (x.len > s->len - s->cur)
7	return false;
8
9	for (int i = 0; i < x.len; i++)
10	if (s->src[s->cur+i] != x.ptr[i])
11	return false;
12
13	s->cur += x.len;
14	return true;
15	}
16

This function returns false if the string wasn't found at the cursor's position, while it returns true and advances the cursor otherwise.

We can use this function for the first and last tokens:

1	Scanner s = { file_contents.ptr, file_contents.len, 0 };
2
3	if (!consume_str(&s, S("BEGIN\n")))
4	return -1; // Error
5
6	// .. parse names ...
7
8	if (!consume_str(&s, S("END")))
9	return -1; // Error
10

Consuming Optional Sequences

BEGIN\n  Linus  Torvalds\n  Bill   Gates\n  Steve  Jobs\n  Mark   Facebook\nEND
       ^
       cur

In our example, every name is preceded by two spaces, but we would like this file format to be insensitive to spaces, which means we should also handle any number of spaces or no spaces at all. If there is some space before the element, we need to consume it. If there is no space, the cursor needs to stay still.

To consume a single optional character, we can use this if statement.

1	if (s.cur < s.len && s.src[s.cur] == ' ')
2	s.cur++;
3

If the cursor is at the end of the file or there is something other than a space, the cursor is left unchanged, just as we wanted. Since we want to consume an arbitrary number of spaces, we can turn this into a loop:

1	while (s.cur < s.len && s.src[s.cur] == ' ')
2	s.cur++;
3

This loop is useful whenever we want to consume an optional sequence of characters. We only need to define an expression that returns true for that sequence alone:

1	while (s.cur < s.len && is_sequence(s.src[s.cur]))
2	s.cur++;
3

Consuming Required Sequences

BEGIN\n  Linus  Torvalds\n  Bill   Gates\n  Steve  Jobs\n  Mark   Facebook\nEND
         ^
         cur

Similarly to spaces, we can see a name as a sequence of letters. Unlike spaces, the name must be present in the file. The way we enforce this is by making sure that there is at least one letter character at the cursor's position. The remaining part of the name can be considered an optional sequence of letters:

1	if (s.cur == s.len \|\| !is_alpha(s.src[s.cur]))
2	return -1; // Missing name
3	s.cur++;
4
5	while (s.cur < s.len && is_alpha(s.src[s.cur]))
6	s.cur++;
7

BEGIN\n  Linus  Torvalds\n  Bill   Gates\n  Steve  Jobs\n  Mark   Facebook\nEND
              ^
              cur

Extracting Substrings

We managed to identify the start and end position of the first name, but how do we extract that information? This is where our String type comes in handy.

If we relied on null-terminated strings, we would need to copy the name character by character while we are consuming it into a buffer in order to append a null byte after it. With our string type we don't need to do that. We can create substrings of larger strings by simply pointing inside them!

Say our cursor is pointing to the start of a token we are interested in:

BEGIN\n  Linus  Torvalds\n  Bill   Gates\n  Steve  Jobs\n  Mark   Facebook\nEND
         ^
         cur = off

We can save the first character's position in a second variable called off and then consume the token until the end:

BEGIN\n  Linus  Torvalds\n  Bill   Gates\n  Steve  Jobs\n  Mark   Facebook\nEND
         ^    ^ cur
         off

The pointer to the string "Linus" can be evaluated by adding off to the start of the source string, and its length as the new cursor position minus the old one.

1	// ... consume stuff before the token we are interested in ...
2
3	int off = s.cur; // Save the index of the first character of the token
4
5	// ... consume the token ...
6
7	String name = { s.src + off, s.cur - off };
8
9	// ... continue parsing ...
10

This way we can effortlessly extract substrings by simply knowing where each token starts and ends!

Parsing the Entire Line

You may have noticed that we have everything we need now. We managed to parse the first name of the first item and the spaces preceding it. The name is then followed by more spaces, then a surname, and then more spaces. We just need to apply the same patterns again and again.

Let's define a couple of helper functions:

1	bool is_space(char c)
2	{
3	return c == ' ';
4	}
5
6	bool is_alpha(char c)
7	{
8	return (c >= 'a' && c <= 'z') \|\| (c >= 'A' && c <= 'Z');
9	}
10
11	void consume_spaces(Scanner *s)
12	{
13	while (s->cur < s->len && is_space(s->src[s->cur]))
14	s->cur++;
15	}
16
17	bool consume_word(Scanner *s)
18	{
19	if (s->cur == s->len \|\| !is_alpha(s->src[s->cur]))
20	return false;
21	s->cur++;
22
23	while (s->cur < s->len && is_alpha(s->src[s->cur]))
24	s->cur++;
25
26	return true;
27	}
28

Parsing a line from the file becomes quite easy:

1	consume_spaces(&s);
2
3	int name_off = s.cur;
4
5	if (!consume_word(&s))
6	return -1; // Error
7
8	String name = { s.src + name_off, s.cur - name_off };
9
10	consume_spaces(&s);
11
12	int surname_off = s.cur;
13
14	if (!consume_word(&s))
15	return -1;
16
17	String surname = { s.src + surname_off, s.cur - surname_off };
18
19	consume_spaces(&s);
20
21	if (s.cur == s.len \|\| s.src[s.cur] != '\n')
22	return -1;
23	s.cur++;
24
25	if (num_people == MAX_PEOPLE)
26	return -1;
27	people[num_people++] = (Person) { name, surname };
28

And to parse the entire file we need to perform this sequence over and over until the END keyword is found.

Putting Everything Together

And here is the final result! By following these patterns correctly, we can be quite sure that our parser is rock-solid! And we didn't need extra allocations or anything :D

1	#include <stdio.h>
2	#include <stdbool.h>
3
4	typedef struct {
5	char *ptr;
6	int len;
7	} String;
8
9	typedef struct {
10	String name;
11	String surname;
12	} Person;
13
14	typedef struct {
15	char *src;
16	int len;
17	int cur;
18	} Scanner;
19
20	#define MAX_PEOPLE 100
21
22	#define S(X) (String) { (X), (int) sizeof(X)-1 }
23
24	bool eqstr(String s1, String s2)
25	{
26	if (s1.len != s2.len)
27	return false;
28	for (int i = 0; i < s1.len; i++)
29	if (s1.ptr[i] != s2.ptr[i])
30	return false;
31	return true;
32	}
33
34	bool is_space(char c)
35	{
36	return c == ' ';
37	}
38
39	bool is_alpha(char c)
40	{
41	return (c >= 'a' && c <= 'z') \|\| (c >= 'A' && c <= 'Z');
42	}
43
44	bool consume_str(Scanner *s, String x)
45	{
46	if (x.len == 0)
47	return false;
48
49	if (x.len > s->len - s->cur)
50	return false;
51
52	for (int i = 0; i < x.len; i++)
53	if (s->src[s->cur+i] != x.ptr[i])
54	return false;
55
56	s->cur += x.len;
57	return true;
58	}
59
60	void consume_spaces(Scanner *s)
61	{
62	while (s->cur < s->len && is_space(s->src[s->cur]))
63	s->cur++;
64	}
65
66	bool consume_word(Scanner *s)
67	{
68	if (s->cur == s->len \|\| !is_alpha(s->src[s->cur]))
69	return false;
70	s->cur++;
71
72	while (s->cur < s->len && is_alpha(s->src[s->cur]))
73	s->cur++;
74
75	return true;
76	}
77
78	int main(void)
79	{
80	String file_contents = S("BEGIN\n Linus Torvalds\n Bill Gates\n Steve Jobs\n Mark Facebook\nEND");
81
82	int num_people = 0;
83	Person people[MAX_PEOPLE];
84
85	Scanner s = { file_contents.ptr, file_contents.len, 0 };
86
87	if (!consume_str(&s, S("BEGIN\n")))
88	return -1; // Error
89
90	for (;;) {
91
92	if (consume_str(&s, S("END")))
93	break;
94
95	consume_spaces(&s);
96
97	int name_off = s.cur;
98
99	if (!consume_word(&s))
100	return -1; // Error
101
102	String name = { s.src + name_off, s.cur - name_off };
103
104	consume_spaces(&s);
105
106	int surname_off = s.cur;
107
108	if (!consume_word(&s))
109	return -1;
110
111	String surname = { s.src + surname_off, s.cur - surname_off };
112
113	consume_spaces(&s);
114
115	if (s.cur == s.len \|\| s.src[s.cur] != '\n')
116	return -1;
117	s.cur++;
118
119	if (num_people == MAX_PEOPLE)
120	return -1;
121	people[num_people++] = (Person) { name, surname };
122	}
123
124	bool found = false;
125	for (int i = 0; i < num_people; i++) {
126	if (eqstr(people[i].name, S("Bill")) &&
127	eqstr(people[i].surname, S("Gates"))) {
128	found = true;
129	break;
130	}
131	}
132	if (found)
133	printf("We found Bill Gates!\n");
134	else
135	printf("Not found\n");
136
137	return 0;
138	}
139

1	if (s.cur+0 == s.len \|\| s.src[s.cur+0] != 'B' \|\|
2	s.cur+1 == s.len \|\| s.src[s.cur+1] != 'E' \|\|
3	s.cur+2 == s.len \|\| s.src[s.cur+2] != 'G' \|\|
4	s.cur+3 == s.len \|\| s.src[s.cur+3] != 'I' \|\|
5	s.cur+4 == s.len \|\| s.src[s.cur+4] != 'N' \|\|
6	s.cur+5 == s.len \|\| s.src[s.cur+5] != '\n')
7	return -1; // Error
8	s.cur += 6;
9

1	if (s.cur+5 >= s.len
2	\|\| s.src[s.cur+0] != 'B'
3	\|\| s.src[s.cur+1] != 'E'
4	\|\| s.src[s.cur+2] != 'G'
5	\|\| s.src[s.cur+3] != 'I'
6	\|\| s.src[s.cur+4] != 'N'
7	\|\| s.src[s.cur+5] != '\n')
8	return -1; // Error
9	s.cur += 6;
10