The Scanner Pattern
Over the years, I have written quite a bit of string manipulation and parsing code in C. Unfortunately, C has no runtime checks that can catch weird behaviors like out-of-bounds array accesses. For this reason, I came up with some safe patterns to follow that allow me to write pretty robust code from the get-go. Of course, one can always make mistakes, but the chances of making them in the first place drop drastically. Not only that, but when one has standard ways of doing things and something breaks, it's quite easy to find the problem by finding where the pattern breaks. The umbrella term I use for these techniques is "The Scanner Pattern", as they all revolve around a "Scanner" object.
The Problem
As I was saying, the Scanner pattern is used to parse information in textual format, such as JSON or HTTP. We usually want textual and flexible formats as they are easier to work with for us humans, but the same can't be said for machines. That's why programs usually translate text into a more native binary representation.
Say we want to write a program that's capable of reading a list of names from a file to determine whether a given name is contained or not. The file uses a human-friendly format:
BEGIN
Linus Torvalds
Bill Gates
Steve Jobs
Mark Facebook
END
It starts and ends with the BEGIN and END keywords, and has a bunch of names in between. The format is insensitive to spaces, so you can have zero or more spaces before and after names and one or more between name and surname.
The result should be an array of structs, each containing a name:
1 | typedef struct { |
2 | char *ptr; |
3 | int len; |
4 | } String; |
5 | |
6 | typedef struct { |
7 | String name; |
8 | String surname; |
9 | } Person; |
10 | |
11 | #define MAX_PEOPLE 100 |
12 | |
13 | #define S(X) (String) { (X), (int) sizeof(X)-1 } |
14 | |
15 | int main(void) |
16 | { |
17 | String file_contents = S("BEGIN\n Linus Torvalds\n Bill Gates\n Steve Jobs\n Mark Facebook\nEND"); |
18 | |
19 | int num_people = 0; |
20 | Person people[MAX_PEOPLE]; |
21 | |
22 | // TODO: parse file_contents and fill up the people array |
23 | |
24 | // TODO: check that the list contains a given name |
25 | |
26 | return 0; |
27 | } |
28 |
The Scanner
The Scanner Pattern is called that way because it's based on a Scanner object:
1 | typedef struct { |
2 | char *src; |
3 | int len; |
4 | int cur; |
5 | } Scanner; |
6 |
The string given by the src
pointer and len
length is the input data. The cur
field contains the next character that needs to be read and is advanced as the program reads the string and extracts information. The cursor starts at 0 and parsing is complete when it reaches the end and is equal to the length of the string.
1 | Scanner s = { file_contents.ptr, file_contents.len, 0 }; |
2 |
I usually use a very short name for it as it's referenced a lot.
Consuming Specific Sequences
BEGIN\n Linus Torvalds\n Bill Gates\n Steve Jobs\n Mark Facebook\nEND
^
cur
The first thing is that we need to check that the file starts with the sequence "BEGIN" followed by a line feed. If it doesn't, we fail the parsing. If it does, we advance the cursor.
1 | if (s.cur == s.len || s.src[s.cur] != 'B') |
2 | return -1; // Error |
3 | s.cur++; |
4 |
This code fails if the cursor doesn't point to the B character. This may be because there is a different character or the string is empty. At the end of this code, if it didn't error, the cursor will point to the second character. We can apply this code multiple times, once per character:
1 | if (s.cur == s.len || s.src[s.cur] != 'B') |
2 | return -1; // Error |
3 | s.cur++; |
4 | |
5 | if (s.cur == s.len || s.src[s.cur] != 'E') |
6 | return -1; // Error |
7 | s.cur++; |
8 | |
9 | if (s.cur == s.len || s.src[s.cur] != 'G') |
10 | return -1; // Error |
11 | s.cur++; |
12 | |
13 | if (s.cur == s.len || s.src[s.cur] != 'I') |
14 | return -1; // Error |
15 | s.cur++; |
16 | |
17 | if (s.cur == s.len || s.src[s.cur] != 'N') |
18 | return -1; // Error |
19 | s.cur++; |
20 | |
21 | if (s.cur == s.len || s.src[s.cur] != '\n') |
22 | return -1; // Error |
23 | s.cur++; |
24 |
This works but is quite verbose. We can group all the checks in a single if statement and then increment the cursor when all of them have passed.
1 | if (s.cur+0 == s.len || s.src[s.cur+0] != 'B' || |
2 | s.cur+1 == s.len || s.src[s.cur+1] != 'E' || |
3 | s.cur+2 == s.len || s.src[s.cur+2] != 'G' || |
4 | s.cur+3 == s.len || s.src[s.cur+3] != 'I' || |
5 | s.cur+4 == s.len || s.src[s.cur+4] != 'N' || |
6 | s.cur+5 == s.len || s.src[s.cur+5] != '\n') |
7 | return -1; // Error |
8 | s.cur += 6; |
9 |
Note that since the last bounds check s.cur+5 == s.len
implies the previous ones, so we can drop them:
1 | if (s.cur+5 >= s.len |
2 | || s.src[s.cur+0] != 'B' |
3 | || s.src[s.cur+1] != 'E' |
4 | || s.src[s.cur+2] != 'G' |
5 | || s.src[s.cur+3] != 'I' |
6 | || s.src[s.cur+4] != 'N' |
7 | || s.src[s.cur+5] != '\n') |
8 | return -1; // Error |
9 | s.cur += 6; |
10 |
If we wanted to be nitpicky, doing bounds checks as 5 >= s.len - s.cur
would reduce the probability of overflowing.
Of course this type of check is only possible when the string we are looking for is known at compile time. We can write a more generic function using a for loop:
1 | bool consume_str(Scanner *s, String x) |
2 | { |
3 | if (x.len == 0) |
4 | return false; |
5 | |
6 | if (x.len > s->len - s->cur) |
7 | return false; |
8 | |
9 | for (int i = 0; i < x.len; i++) |
10 | if (s->src[s->cur+i] != x.ptr[i]) |
11 | return false; |
12 | |
13 | s->cur += x.len; |
14 | return true; |
15 | } |
16 |
This function returns false if the string wasn't found at the cursor's position, while it returns true and advances the cursor otherwise.
We can use this function for the first and last tokens:
1 | Scanner s = { file_contents.ptr, file_contents.len, 0 }; |
2 | |
3 | if (!consume_str(&s, S("BEGIN\n"))) |
4 | return -1; // Error |
5 | |
6 | // .. parse names ... |
7 | |
8 | if (!consume_str(&s, S("END"))) |
9 | return -1; // Error |
10 |
Consuming Optional Sequences
BEGIN\n Linus Torvalds\n Bill Gates\n Steve Jobs\n Mark Facebook\nEND
^
cur
In our example, every name is preceded by two spaces, but we would like this file format to be insensitive to spaces, which means we should also handle any number of spaces or no spaces at all. If there is some space before the element, we need to consume it. If there is no space, the cursor needs to stay still.
To consume a single optional character, we can use this if statement.
1 | if (s.cur < s.len && s.src[s.cur] == ' ') |
2 | s.cur++; |
3 |
If the cursor is at the end of the file or there is something other than a space, the cursor is left unchanged, just as we wanted. Since we want to consume an arbitrary number of spaces, we can turn this into a loop:
1 | while (s.cur < s.len && s.src[s.cur] == ' ') |
2 | s.cur++; |
3 |
This loop is useful whenever we want to consume an optional sequence of characters. We only need to define an expression that returns true for that sequence alone:
1 | while (s.cur < s.len && is_sequence(s.src[s.cur])) |
2 | s.cur++; |
3 |
Consuming Required Sequences
BEGIN\n Linus Torvalds\n Bill Gates\n Steve Jobs\n Mark Facebook\nEND
^
cur
Similarly to spaces, we can see a name as a sequence of letters. Unlike spaces, the name must be present in the file. The way we enforce this is by making sure that there is at least one letter character at the cursor's position. The remaining part of the name can be considered an optional sequence of letters:
1 | if (s.cur == s.len || !is_alpha(s.src[s.cur])) |
2 | return -1; // Missing name |
3 | s.cur++; |
4 | |
5 | while (s.cur < s.len && is_alpha(s.src[s.cur])) |
6 | s.cur++; |
7 |
BEGIN\n Linus Torvalds\n Bill Gates\n Steve Jobs\n Mark Facebook\nEND
^
cur
Extracting Substrings
We managed to identify the start and end position of the first name, but how do we extract that information? This is where our String
type comes in handy.
If we relied on null-terminated strings, we would need to copy the name character by character while we are consuming it into a buffer in order to append a null byte after it. With our string type we don't need to do that. We can create substrings of larger strings by simply pointing inside them!
Say our cursor is pointing to the start of a token we are interested in:
BEGIN\n Linus Torvalds\n Bill Gates\n Steve Jobs\n Mark Facebook\nEND
^
cur = off
We can save the first character's position in a second variable called off
and then consume the token until the end:
BEGIN\n Linus Torvalds\n Bill Gates\n Steve Jobs\n Mark Facebook\nEND
^ ^ cur
off
The pointer to the string "Linus" can be evaluated by adding off
to the start of the source string, and its length as the new cursor position minus the old one.
1 | // ... consume stuff before the token we are interested in ... |
2 | |
3 | int off = s.cur; // Save the index of the first character of the token |
4 | |
5 | // ... consume the token ... |
6 | |
7 | String name = { s.src + off, s.cur - off }; |
8 | |
9 | // ... continue parsing ... |
10 |
This way we can effortlessly extract substrings by simply knowing where each token starts and ends!
Parsing the Entire Line
You may have noticed that we have everything we need now. We managed to parse the first name of the first item and the spaces preceding it. The name is then followed by more spaces, then a surname, and then more spaces. We just need to apply the same patterns again and again.
Let's define a couple of helper functions:
1 | bool is_space(char c) |
2 | { |
3 | return c == ' '; |
4 | } |
5 | |
6 | bool is_alpha(char c) |
7 | { |
8 | return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'); |
9 | } |
10 | |
11 | void consume_spaces(Scanner *s) |
12 | { |
13 | while (s->cur < s->len && is_space(s->src[s->cur])) |
14 | s->cur++; |
15 | } |
16 | |
17 | bool consume_word(Scanner *s) |
18 | { |
19 | if (s->cur == s->len || !is_alpha(s->src[s->cur])) |
20 | return false; |
21 | s->cur++; |
22 | |
23 | while (s->cur < s->len && is_alpha(s->src[s->cur])) |
24 | s->cur++; |
25 | |
26 | return true; |
27 | } |
28 |
Parsing a line from the file becomes quite easy:
1 | consume_spaces(&s); |
2 | |
3 | int name_off = s.cur; |
4 | |
5 | if (!consume_word(&s)) |
6 | return -1; // Error |
7 | |
8 | String name = { s.src + name_off, s.cur - name_off }; |
9 | |
10 | consume_spaces(&s); |
11 | |
12 | int surname_off = s.cur; |
13 | |
14 | if (!consume_word(&s)) |
15 | return -1; |
16 | |
17 | String surname = { s.src + surname_off, s.cur - surname_off }; |
18 | |
19 | consume_spaces(&s); |
20 | |
21 | if (s.cur == s.len || s.src[s.cur] != '\n') |
22 | return -1; |
23 | s.cur++; |
24 | |
25 | if (num_people == MAX_PEOPLE) |
26 | return -1; |
27 | people[num_people++] = (Person) { name, surname }; |
28 |
And to parse the entire file we need to perform this sequence over and over until the END keyword is found.
Putting Everything Together
And here is the final result! By following these patterns correctly, we can be quite sure that our parser is rock-solid! And we didn't need extra allocations or anything :D
1 | #include <stdio.h> |
2 | #include <stdbool.h> |
3 | |
4 | typedef struct { |
5 | char *ptr; |
6 | int len; |
7 | } String; |
8 | |
9 | typedef struct { |
10 | String name; |
11 | String surname; |
12 | } Person; |
13 | |
14 | typedef struct { |
15 | char *src; |
16 | int len; |
17 | int cur; |
18 | } Scanner; |
19 | |
20 | #define MAX_PEOPLE 100 |
21 | |
22 | #define S(X) (String) { (X), (int) sizeof(X)-1 } |
23 | |
24 | bool eqstr(String s1, String s2) |
25 | { |
26 | if (s1.len != s2.len) |
27 | return false; |
28 | for (int i = 0; i < s1.len; i++) |
29 | if (s1.ptr[i] != s2.ptr[i]) |
30 | return false; |
31 | return true; |
32 | } |
33 | |
34 | bool is_space(char c) |
35 | { |
36 | return c == ' '; |
37 | } |
38 | |
39 | bool is_alpha(char c) |
40 | { |
41 | return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'); |
42 | } |
43 | |
44 | bool consume_str(Scanner *s, String x) |
45 | { |
46 | if (x.len == 0) |
47 | return false; |
48 | |
49 | if (x.len > s->len - s->cur) |
50 | return false; |
51 | |
52 | for (int i = 0; i < x.len; i++) |
53 | if (s->src[s->cur+i] != x.ptr[i]) |
54 | return false; |
55 | |
56 | s->cur += x.len; |
57 | return true; |
58 | } |
59 | |
60 | void consume_spaces(Scanner *s) |
61 | { |
62 | while (s->cur < s->len && is_space(s->src[s->cur])) |
63 | s->cur++; |
64 | } |
65 | |
66 | bool consume_word(Scanner *s) |
67 | { |
68 | if (s->cur == s->len || !is_alpha(s->src[s->cur])) |
69 | return false; |
70 | s->cur++; |
71 | |
72 | while (s->cur < s->len && is_alpha(s->src[s->cur])) |
73 | s->cur++; |
74 | |
75 | return true; |
76 | } |
77 | |
78 | int main(void) |
79 | { |
80 | String file_contents = S("BEGIN\n Linus Torvalds\n Bill Gates\n Steve Jobs\n Mark Facebook\nEND"); |
81 | |
82 | int num_people = 0; |
83 | Person people[MAX_PEOPLE]; |
84 | |
85 | Scanner s = { file_contents.ptr, file_contents.len, 0 }; |
86 | |
87 | if (!consume_str(&s, S("BEGIN\n"))) |
88 | return -1; // Error |
89 | |
90 | for (;;) { |
91 | |
92 | if (consume_str(&s, S("END"))) |
93 | break; |
94 | |
95 | consume_spaces(&s); |
96 | |
97 | int name_off = s.cur; |
98 | |
99 | if (!consume_word(&s)) |
100 | return -1; // Error |
101 | |
102 | String name = { s.src + name_off, s.cur - name_off }; |
103 | |
104 | consume_spaces(&s); |
105 | |
106 | int surname_off = s.cur; |
107 | |
108 | if (!consume_word(&s)) |
109 | return -1; |
110 | |
111 | String surname = { s.src + surname_off, s.cur - surname_off }; |
112 | |
113 | consume_spaces(&s); |
114 | |
115 | if (s.cur == s.len || s.src[s.cur] != '\n') |
116 | return -1; |
117 | s.cur++; |
118 | |
119 | if (num_people == MAX_PEOPLE) |
120 | return -1; |
121 | people[num_people++] = (Person) { name, surname }; |
122 | } |
123 | |
124 | bool found = false; |
125 | for (int i = 0; i < num_people; i++) { |
126 | if (eqstr(people[i].name, S("Bill")) && |
127 | eqstr(people[i].surname, S("Gates"))) { |
128 | found = true; |
129 | break; |
130 | } |
131 | } |
132 | if (found) |
133 | printf("We found Bill Gates!\n"); |
134 | else |
135 | printf("Not found\n"); |
136 | |
137 | return 0; |
138 | } |
139 |