String Parsing Patterns

Over the years, I came up with several patterns for parsing strings without a tokenization step. I found that following these patterns consciously made it easy to write robust string parsing code.

In this post I'll list them and show how to use them to make a simplified JSON parser. Code snippets will use the C language, but the logic applies to any C-like language.

General Structure

Let's assume char *src is a string of length int len we want to extract information from. We will use the int cur cursor variable to keep track of how many characters we already inspected from the string. The cursor is such that src[cur] is the next character that hasn't been inspected yet. If cur == len, all characters have been inspected. Note that I sometimes refer to inspecting characters as "skipping" or "consuming" them.

Consuming an Optional Sequence

To move the cursor over a sequence of characters, you write a loop like this:

1	while (cur < len && test(src[cur]))
2	cur++;
3

the sequence is defined by the test function. All and only characters which pass the test are consumed. If no characters pass the test, the cursor is left unchanged. This is why I call this "consuming an optional sequence".

Consuming a Required Sequence

What if we want to consume a non optional sequence? In other words, the sequence must contain at least one character. The way this is done is by adding a first if statement:

1	if (cur == len \|\| !test(src[cur]))
2	return; // Error
3	do
4	cur++;
5	while (cur < len && test(src[cur]));
6

we added an if statement which checks the negated version of the loop's condition. If the if's check passes, we know there is at least one character, and therefore can use a do..while instead of a regular while loop.

Consuming a Specific Character

We can consume an optional character like this:

1	if (cur < len && src[cur] == 'X')
2	cur++;
3

and we can consume a required character like this:

1	if (cur == len \|\| src[cur] != 'X')
2	return; // Error
3	cur++;
4

Consuming a Specific String

To consume a specific string, we can of course consume the single characters:

1	if (cur == len \|\| src[cur] != 'A')
2	return; // Error
3	cur++;
4
5	if (cur == len \|\| src[cur] != 'B')
6	return; // Error
7	cur++;
8
9	if (cur == len \|\| src[cur] != 'C')
10	return; // Error
11	cur++;
12

but we can do better:

1	if (2 >= len - cur
2	\|\| src[cur+0] != 'A'
3	\|\| src[cur+1] != 'B'
4	\|\| src[cur+2] != 'C')
5	return; // Error
6	cur += 3;
7

This way we perform a single bounds check and it's more compact overall. This isn't very intuitive for me either, so this is what I think when writing this.

First, I ignore the string's bounds and check the characters:

1	if (
2	\|\| src[cur+0] != 'A'
3	\|\| src[cur+1] != 'B'
4	\|\| src[cur+2] != 'C')
5	return; // Error
6	cur += 3;
7

the index furthest away from the cursor is cur+2. If this offset is out of bounds, they all are. Therefore the checks are valid only if cur+2 < len, and they are surely not valid when cur+2 >= len. We add this to our expression:

1	if (cur+2 >= len
2	\|\| src[cur+0] != 'A'
3	\|\| src[cur+1] != 'B'
4	\|\| src[cur+2] != 'C')
5	return; // Error
6	cur += 3;
7

if we want to be extra careful here we can improve the bounds check to avoid overflow

1	if (2 >= len - cur
2	\|\| src[cur+0] != 'A'
3	\|\| src[cur+1] != 'B'
4	\|\| src[cur+2] != 'C')
5	return; // Error
6	cur += 3;
7

this is because len - cur can never underflow, while cur+2 may overflow. This is very nitpicky though.

Parsing Strings

What if we want to not only consume characters, but extract information from the source string? Fear not! We can easily use our previous lessons for this. Let's say we want to parse a string in the form <spaces> <name> <spaces> <nickname>:

1	int off;
2
3	while (cur < len && is_whitespace(src[cur]))
4	cur++;
5
6	off = cur;
7	while (cur < len && !is_whitespace(src[cur]))
8	cur++;
9	char *name_ptr = src + off;
10	int name_len = cur - off;
11
12	while (cur < len && is_whitespace(src[cur]))
13	cur++;
14
15	off = cur;
16	while (cur < len && !is_whitespace(src[cur]))
17	cur++;
18	char *nick_ptr = src + off;
19	int nick_len = cur - off;
20
21	while (cur < len && is_whitespace(src[cur]))
22	cur++;
23

with this, we end up with pointers to name and nickname and their relative lengths, with no memory allocation! Note that the strings are not null-terminated. If you want to print them with printf, you'll have to use this format specifier:

1	printf("Hello, I'm %.s and my nickname is %.s\n", name_len, name_ptr, nick_len, nick_ptr);
2

but note that using !is_whitespace(src[cur]) (with the negation) is inherently unsafe. The robust solution is to always specify which characters are allowed, not the other way around. The robust version should use something like this instead of just allowing non-whitespace characters:

1	char is_name_or_nick(char c)
2	{
3	return (c >= 'a' && c <= 'z') \|\| (c >= 'A' && c <= 'Z');
4	}
5

Parsing Integers

This is how we can scan a sequence of digits and build the equivalent integer representation of it:

1	#include <limits.h> // Defines INT_MAX
2
3	typedef struct {
4	char *src;
5	int len;
6	int cur;
7	} Scanner;
8
9	int is_digit(char c)
10	{
11	return c >= '0' && c <= '9';
12	}
13
14	int parse_int(Scanner s, int out)
15	{
16	if (s->cur == s->len \|\| !is_digit(s->src[s->cur]))
17	return 0; // Error
18
19	int x = 0;
20	do {
21	int d = s->src[s->cur++] - '0';
22	if (x > (INT_MAX - d) / 10)
23	return 0; // Overflow
24	x = x * 10 + d;
25	} while (s->cur < s->len && is_digit(s->src[s->cur]));
26
27	*out = x;
28	return 1;
29	}
30

Parsing Floats

This is how one can parse floats:

1	typedef struct {
2	char *src;
3	int len;
4	int cur;
5	} Scanner;
6
7	int is_digit(char c)
8	{
9	return c >= '0' && c <= '9';
10	}
11
12	int parse_float(Scanner s, float out)
13	{
14	if (s->cur == s->len \|\| !is_digit(s->src[s->cur]))
15	return 0; // Error
16
17	float x = 0;
18	do {
19	int d = s->src[s->cur++] - '0';
20	x = x * 10 + d;
21	} while (s->cur < s->len && is_digit(s->src[s->cur]));
22
23	if (s->cur == s->len \|\| s->src[s->cur] != '.')
24	return 0; // Error
25	s->cur++;
26
27	if (s->cur == s->len \|\| !is_digit(s->src[s->cur]))
28	return 0; // Error
29	float q = 1;
30	do {
31	q /= 10;
32	x += q * (s->src[s->cur++] - '0');
33	} while (s->cur < s->len && is_digit(s->src[s->cur]));
34
35	*out = x;
36	return 1;
37	}
38

this code looks quite involved, but actually it's just the same pattern applied over and over.

Example

This is an example JSON parser written using the previous patterns. For simplicity I just output the parsed values instead of building an in-memory representation.

1	#define ASSERT(X) {if (!(X)) __builtin_trap();}
2
3	typedef struct {
4	char *src;
5	int len;
6	int cur;
7	} Scanner;
8
9	int is_digit(char c)
10	{
11	return c >= '0' && c <= '9';
12	}
13
14	int is_whitespace(char c)
15	{
16	return c == ' ' \|\| c == '\t' \|\| c == '\r' \|\| c == '\n';
17	}
18
19	int parse_string(Scanner *s)
20	{
21	ASSERT(s->cur < s->len && s->src[s->cur] == '"');
22	s->cur++;
23
24	int off = s->cur;
25	while (s->cur < s->len && s->src[s->cur] != '"')
26	s->cur++;
27	char *p = s->src + off;
28	int l = s->cur - off;
29
30	if (s->cur == s->len)
31	return 0; // Error
32	s->cur++;
33
34	printf("Parsed string \"%.*s\"\n", l, p);
35
36	return 1;
37	}
38
39	int parse_object(Scanner *s)
40	{
41	ASSERT(s->cur < s->len && s->src[s->cur] == '{');
42	s->cur++;
43
44	while (s->cur < s->len && is_whitespace(s->src[s->cur]))
45	s->cur++;
46
47	if (s->cur < s->len && s->src[s->cur] != '}') {
48
49	for (;;) {
50
51	if (!parse_string(s))
52	return 0; // Propagate error
53
54	while (s->cur < s->len && is_whitespace(s->src[s->cur]))
55	s->cur++;
56
57	if (s->cur == s->len \|\| s->src[s->cur] != ':')
58	return 0; // Error
59	s->cur++;
60
61	while (s->cur < s->len && is_whitespace(s->src[s->cur]))
62	s->cur++;
63
64	if (!parse_value(s))
65	return 0; // Propagate error
66
67	while (s->cur < s->len && is_whitespace(s->src[s->cur]))
68	s->cur++;
69
70	if (s->cur < s->len && s->src[s->cur] == '}')
71	break;
72
73	if (s->cur == s->len \|\| s->src[s->cur] != ',')
74	return 0; // Error
75	s->cur++;
76
77	while (s->cur < s->len && is_whitespace(s->src[s->cur]))
78	s->cur++;
79	}
80	}
81	ASSERT(s->cur < s->len && s->src[s->cur] == '}');
82	s->cur++;
83	return 1;
84	}
85
86	int parse_array(Scanner *s)
87	{
88	ASSERT(s->cur < s->len && s->src[s->cur] == '[');
89	s->cur++;
90
91	while (s->cur < s->len && is_whitespace(s->src[s->cur]))
92	s->cur++;
93
94	if (s->cur < s->len && s->src[s->cur] != ']') {
95
96	for (;;) {
97
98	if (!parse_value(s))
99	return 0; // Propagate error
100
101	while (s->cur < s->len && is_whitespace(s->src[s->cur]))
102	s->cur++;
103
104	if (s->cur < s->len && s->src[s->cur] == ']')
105	break;
106
107	if (s->cur == s->len \|\| s->src[s->cur] != ',')
108	return 0; // Error
109	s->cur++;
110
111	while (s->cur < s->len && is_whitespace(s->src[s->cur]))
112	s->cur++;
113	}
114	}
115	ASSERT(s->cur < s->len && s->src[s->cur] == ']');
116	s->cur++;
117	return 1;
118	}
119
120	int parse_int(Scanner *s)
121	{
122	if (s->cur == s->len \|\| !is_digit(s->src[s->cur]))
123	return 0; // Error
124
125	int x = 0;
126	do {
127	int d = s->src[s->cur++] - '0';
128	if (x > (INT_MAX - d) / 10)
129	return 0; // Overflow
130	x = x * 10 + d;
131	} while (s->cur < s->len && is_digit(s->src[s->cur]));
132
133	printf("Parsed integer %d\n", x);
134	return 1;
135	}
136
137	int parse_float(Scanner *s)
138	{
139	if (s->cur == s->len \|\| !is_digit(s->src[s->cur]))
140	return 0; // Error
141
142	float x = 0;
143	do {
144	int d = s->src[s->cur++] - '0';
145	x = x * 10 + d;
146	// Note that parse_number already made sure
147	// the sequence of digits terminated with a dot,
148	// so we can omit the cur<len check
149	} while (s->src[s->cur] != '.');
150
151	if (s->cur == s->len \|\| s->src[s->cur] != '.')
152	return 0; // Error
153	s->cur++;
154
155	if (s->cur == s->len \|\| !is_digit(s->src[s->cur]))
156	return 0; // Error
157	float q = 1;
158	do {
159	q /= 10;
160	x += q * (s->src[s->cur++] - '0');
161	} while (s->cur < s->len && is_digit(s->src[s->cur]));
162
163	printf("Parsed float %f\n", x);
164	return 1;
165	}
166
167	int parse_number(Scanner *s)
168	{
169	int peek = s->cur;
170	while (peek < s->len && is_digit(s->src[peek]))
171	peek++;
172
173	if (peek == s->len \|\| s->src[peek] != '.')
174	return parse_int(s);
175
176	return parse_float(s);
177	}
178
179	// Parses any type of value. The cursor must point to the first
180	// character of the value. Any preceding whitespace must be
181	// consume by the caller.
182	int parse_value(Scanner *s)
183	{
184	if (s->cur == s->len)
185	return 0; // Error
186
187	if (s->src[s->cur] == '"')
188	return parse_string(s);
189
190	if (s->src[s->cur] == '{')
191	return parse_object(s);
192
193	if (s->src[s->cur] == '[')
194	return parse_array(s);
195
196	if (is_digit(s->src[s->cur]))
197	return parse_number(s);
198
199	if (3 < s->len - s->cur
200	&& s->src[s->cur+0] == 't'
201	&& s->src[s->cur+1] == 'r'
202	&& s->src[s->cur+2] == 'u'
203	&& s->src[s->cur+3] == 'e') {
204	s->cur += 4;
205	printf("Parsed true\n");
206	return 1;
207	}
208
209	if (3 < s->len - s->cur
210	&& s->src[s->cur+0] == 'n'
211	&& s->src[s->cur+1] == 'u'
212	&& s->src[s->cur+2] == 'l'
213	&& s->src[s->cur+3] == 'l') {
214	s->cur += 4;
215	printf("Parsed null\n");
216	return 1;
217	}
218
219	if (4 < s->len - s->cur
220	&& s->src[s->cur+0] == 'f'
221	&& s->src[s->cur+1] == 'a'
222	&& s->src[s->cur+2] == 'l'
223	&& s->src[s->cur+3] == 's'
224	&& s->src[s->cur+4] == 'e') {
225	s->cur += 5;
226	printf("Parsed false\n");
227	return 1;
228	}
229
230	// Invalid character
231	return 0;
232	}
233
234	// Returns 0 on error, 1 on success
235	int parse_json(char *src, int len)
236	{
237	Scanner s = {src, len, 0};
238
239	while (s.cur < s.len && is_whitespace(s.src[s.cur]))
240	s.cur++;
241
242	if (!parse_value(&s))
243	return 0;
244
245	if (s.cur == s.len)
246	return 0;
247
248	return 1;
249	}
250