Writing an HTTP compliant parser in C

Recently I published a library called cHTTP on GitHub, which is essentially my private networking code polished and made reusable. Yesterday I decided I would go through the HTTP request parser and make sure it is compliant to spec, so I took the opportunity to write about it. In this post I will go throgh the specification to turn it into an HTTP request parser using C.

This article will focus more on following the specification than the actual parsing code. If you're not familiar with string parsing in C, consider reading my post about the scanner pattern where I document my way of doing things. Also, if for some reason you're not familiar with HTTP protocol or it's syntax, you should go read this post.

How to Read the Spec

The HTTP protocol is standardized by documents called RFCs. Often new documents that obsolete old ones are published. For instance the first RFC for HTTP 1.1 was RFC 2616, but the latest specification at the time of writing is RFC 9112.

If we take a look at RFC 9112, we'll find that in section 2 the syntax of an "HTTP message" is defined:

HTTP-message = start-line CRLF
               *( field-line CRLF )
               CRLF
               [ message-body ]

It's expressed using a notation called Augmented Backus-Naur Form (ABNF) and this is essentially what we'll need to turn into code. I'll go into its details as we write code, but if you want to learn about it in more detail, read section 1.2.

The basic idea is that every "token" (start-line, CRLF, field-line, message-body) represents a pattern of characters called "rules". Token representing rules can be grouped into higher level rules, such as this one. Tokens with uppercase names usually are primitive tokens that represent simple characters. The special characters such as parenthesis or the asterisk specify how these tokens combine to make the higher level rule. For instance, the asterisk means that the tokens following it can repeat zero or more times and the square brackets mean that the tokens inside it are optional.

So our job will be to read the document, unpack these rules and turn them into code. Now let's dive in!

Parsing the Request Line

Let's make sense of that HTTP-message rule by comparing it to a practical example of HTTP request

HTTP-message = start-line CRLF
               *( field-line CRLF )
               CRLF
               [ message-body ]

GET /index.html HTTP/1.1\r\n
Host: coz.is\r\n
User-Agent: curl/7.81.0\r\n
\r\n

A quick string search of CRLF shows that it is defined in section 1.2 as a carriage return and line feed \r\n. So intuitively we can observe that:

start-line   => GET /index.html HTTP/1.1
field-line   => Host: coz.is
field-line   => User-Agent: curl/7.81.0
message-body => (empty)

The start-line matches the line where method, path, and version are specified, the field-line matches the headers, and the message-body matches the body (which is not present here).

Let's unpack the rule further by looking at how start-line is defined (as always, we do a string search in the document):

start-line = request-line / status-line

request-line = method SP request-target SP HTTP-version

The slash / means only one of the two rules apply. Since we are parsing HTTP requests, we are interested in the request-line rule. Just like the CRLF token, the SP token is defined in section 1.2 as a single space.

method         => GET
request-target => /index.html
HTTP-version   => HTTP/1.1

Parsing the Method

If we look for the method rule, we find that it's definition references some other document called [HTTP]

method = token

token = <token, see [HTTP], Section 5.6.2>

If we follow the reference (look for [HTTP] in the references section) we find that the rule is defined in RFC 9110

token = 1*tchar

tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*"
      / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
      / DIGIT / ALPHA
      ; any VCHAR, except delimiters

Now we have a good idea of how to parse the method. We want to parse one or more tchars followed by a space! That's easy enough:

1
2	// DIGIT
3	bool is_digit(char c)
4	{
5	return c >= '0' && c <= '9';
6	}
7
8	// ALPHA
9	bool is_alpha(char c)
10	{
11	return (c >= 'a' && c <= 'z') \|\| (c >= 'A' && c <= 'Z');
12	}
13
14	// tchar
15	bool is_tchar(char c)
16	{
17	return c == '!' \|\| c == '#' \|\| c == '$' \|\| c == '%' \|\| c == '&'
18	\|\| c == '\'' \|\| c == '*' \|\| c == '+' \|\| c == '-' \|\| c == '.'
19	\|\| c == '^' \|\| c == '_' \|\| c == '`' \|\| c == '\|' \|\| c == '~'
20	\|\| is_digit(c) \|\| is_alpha(c);
21	}
22
23	int parse_method(Scanner s, String method)
24	{
25	int method_off = s->cur;
26
27	// Consume a tchar or fail
28	if (s->cur == s->len \|\| !is_tchar(s->src[s->cur]))
29	return -1; // Error! Missing token
30	s->cur++;
31
32	// Consume additional tchars following the first one
33	while (s->cur < s->len && is_tchar(s->src[s->cur]))
34	s->cur++;
35
36	// Make a substring of the method
37	*method = (String) {
38	s->src + method_off,
39	s->cur - method_off
40	};
41
42	// Consume a space or fail
43	if (s->cur == s->len \|\| s->src[s->cur] != ' ')
44	return -1;
45	s->cur++;
46
47	// All good!
48	return 0;
49	}
50

Parsing the Request Target

Now that we parsed the method, let's go back to the request-line rule:

request-line = method SP request-target SP HTTP-version

The next rule is request-target. I happen to know that this one is quite complex. This rule alone takes more code than the rest of the parser, so we can take a shortcut and say that the request target is anything that goes from the first space to the second one preceding the version. This isn't ideal but most server do it. Note that it's always safer to parse strings by explicitly allowing a set of characters, which is not what we are doing here as we are letting though anything that is not a space.

With this simplification parsing the target should be quite straight-forward

1	int parse_target(Scanner s, String target)
2	{
3	int off = s->cur;
4
5	while (s->cur < s->len && s->src[s->cur] != ' ')
6	s->cur++;
7
8	*target = (String) {
9	s->src + off,
10	s->cur - off,
11	};
12
13	if (s->cur == s->len)
14	return -1;
15	s->cur++;
16
17	return 0;
18	}
19

Parsing the Version

Following the target comes the HTTP-version rule:

HTTP-name = %x48.54.54.50 ; HTTP
HTTP-version = HTTP-name "/" DIGIT "." DIGIT

The %x notation just means a sequence of character specified by their hex representation. If you look the numbers in the ASCII table, you will find that they simpli represent the string "HTTP". I think they used this notation to express that the string must be uppercase. Following the "HTTP" string comes the "/" character and then a major version number and a minor version number separated by a dot. Since this is an HTTP 1.1 parser we only allow versions "1.1" and "1.0", which means the rule might as well be this for us

HTTP-version = "HTTP/1.0" / "HTTP/1.1"

For this, we can rely on our super useful consume_str function defined in the scanner pattern post

1	bool consume_str(Scanner *s, String x)
2	{
3	if (x.len == 0)
4	return false;
5
6	if (x.len > s.len - s.cur)
7	return false;
8
9	for (int i = 0; i < x.len; i++)
10	if (s->src[s->cur+i] != x.ptr[i])
11	return false;
12
13	s->cur += x.len;
14	return true;
15	}
16
17	int parse_version(Scanner s, int minor)
18	{
19	if (consume_str(s, S("HTTP/1.0\r\n"))) {
20	*minor = 0;
21	return 0;
22	}
23
24	if (consume_str(s, S("HTTP/1.1\r\n"))) {
25	*minor = 1;
26	return 0;
27	}
28
29	return -1;
30	}
31

Since the major version is always 1, we just need to return the minor version. Also as we did with the method and target parsing function, we make the version function also consume the CRLF separator that follows it.

And this completes are parsing of the request line!

Parsing Headers

Next come the field-line rule

HTTP-message = start-line CRLF
               *( field-line CRLF )
               CRLF
               [ message-body ]

This rule can occur zero or more times, which means we'll need to write a function to parse a single field-line and then call it in a loop.

A single field-line is defined as this:

field-line = field-name ":" OWS field-value OWS

We have a name and a value separated by a semicolon. The value is padded by spaces (string search OWS in the document). The name and value rules are defined in RFC 9110

field-name = <field-name, see [HTTP], Section 5.1>
field-value = <field-value, see [HTTP], Section 5.5>

Following the references we get to these definitions

field-name = token

field-value    = *field-content
field-content  = field-vchar
                 [ 1*( SP / HTAB / field-vchar ) field-vchar ]
field-vchar    = VCHAR / obs-text
obs-text       = %x80-FF

The name rule is simple enough, while the value rule is quite verbose. Fortunately, that's just a way of saying that the header value can't contain leading and trailing spaces. Other than that it can contain spaces, tabs, any visible ASCII caracters (referred to as VCHARs) or characters with values 0x80 to 0xFF that correspond to UTF-8 character bytes. Unfortunately there is a lot of weirdness with UTF-8 so I prefer to only allow ASCII.

1	typedef struct {
2	String name;
3	String value;
4	} Header;
5
6	bool is_vchar(char c)
7	{
8	return c >= ' ' && c <= '~';
9	}
10
11	int parse_header(Scanner s, Header header)
12	{
13	// Parse the name
14
15	int name_off = s->cur;
16	if (s->cur == s->len \|\| !is_tchar(s->src[s->cur]))
17	return -1;
18	s->cur++;
19
20	while (s->cur < s->len && is_tchar(s->src[s->cur]))
21	s->cur++;
22	header->name = (String) { s->src + name_off, s->cur - name_off };
23
24	// Consume the separator
25	if (s->cur == s->len \|\| s->src[s->cur] != ':')
26	return -1;
27	s->cur++;
28
29	// Consume whitespace preceding the value
30	while (s->cur < s->len && (s->src[s->cur] == ' ' \|\| s->src[s->cur] == '\t'))
31	s->cur++;
32
33	// Parse the value
34
35	int value_off = s->cur;
36
37	// Consume all VCHARs and spaces
38	while (s->cur < s->len && (is_vchar(s->src[s->cur]) \|\| s->src[s->cur] == ' ' \|\| s->src[s->cur] == '\t'))
39	s->cur++;
40
41	// If the body ended with some spaces, remove them. Note how this loop
42	// doesn't have bound checks. We can do this because we know the header
43	// contains at least one tchar
44	while (s->src[s->cur] == ' ' \|\| s->src[s->cur] == '\t')
45	s->cur--;
46
47	// Make a slice for the value
48	header->value = (String) { s->src + value_off, s->cur - value_off };
49
50	// Consume any spaces that follow the value
51	while (s->cur < s->len && (s->src[s->cur] == ' ' \|\| s->src[s->cur] == '\t'))
52	s->cur++;
53
54	// Consume the CRLF following the header field
55
56	if (1 >= s->len - s->cur
57	\|\| s->src[s->cur+0] != '\r'
58	\|\| s->src[s->cur+1] != '\n')
59	return -1;
60	s->cur += 2;
61
62	return 0;
63	}
64

As we were saying earlier, the header rule may apply multiple times until the final CRLF that precedes the body is found. Let's define a parsing function for the entire header list:

1	int parse_header_list(Scanner s, Header headers, int max_headers)
2	{
3	int num_headers = 0;
4	while (!consume_str(s, S("\r\n"))) {
5
6	if (num_headers == max_headers)
7	return -1;
8
9	int ret = parse_header(s, &headers[num_headers]);
10	if (ret < 0)
11	return -1;
12
13	num_headers++;
14	}
15
16	return num_headers;
17	}
18

Parsing the Body

The last thing to do is parse the message-body rule, which is defined in section 6 of RFC 9112

message-body = *OCTET

by octet the RFC means a generic byte, which means the body is expressed as a generic stream of data. The length of the data depends on the header values. If the request contains the Content-Length header, then the body is a single chunk of bytes. If the request contains the Transfer-Encoding header, then the body is actually a list of chunks each with its own header that needs parsing. This alone, as the request target parsing does, warrants its own article! So for now, you'll need to try implement body parsing on your own :)

Putting Everything Together

And here it is, the complete parser with a basic test of our initial input!

1	#include <stdio.h>
2	#include <stdbool.h>
3
4	#define MAX_HEADERS 128
5
6	#define S(X) (String) { (X), (int) sizeof(X)-1 }
7	#define UNPACK(X) (X).len, (X).ptr
8	#define COUNT(X) (sizeof(X) / sizeof((X)[0]))
9
10	typedef struct {
11	char *src;
12	int len;
13	int cur;
14	} Scanner;
15
16	typedef struct {
17	char *ptr;
18	int len;
19	} String;
20
21	typedef struct {
22	String name;
23	String value;
24	} Header;
25
26	typedef struct {
27	String method;
28	String target;
29	int minor;
30	int num_headers;
31	Header headers[MAX_HEADERS];
32	} Request;
33
34	bool is_digit(char c)
35	{
36	return c >= '0' && c <= '9';
37	}
38
39	bool is_alpha(char c)
40	{
41	return (c >= 'a' && c <= 'z') \|\| (c >= 'A' && c <= 'Z');
42	}
43
44	bool is_tchar(char c)
45	{
46	return c == '!' \|\| c == '#' \|\| c == '$' \|\| c == '%' \|\| c == '&'
47	\|\| c == '\'' \|\| c == '*' \|\| c == '+' \|\| c == '-' \|\| c == '.'
48	\|\| c == '^' \|\| c == '_' \|\| c == '`' \|\| c == '\|' \|\| c == '~'
49	\|\| is_digit(c) \|\| is_alpha(c);
50	}
51
52	bool consume_str(Scanner *s, String x)
53	{
54	if (x.len == 0)
55	return false;
56
57	if (x.len > s->len - s->cur)
58	return false;
59
60	for (int i = 0; i < x.len; i++)
61	if (s->src[s->cur+i] != x.ptr[i])
62	return false;
63
64	s->cur += x.len;
65	return true;
66	}
67
68	int parse_method(Scanner s, String method)
69	{
70	int method_off = s->cur;
71
72	// Consume a tchar or fail
73	if (s->cur == s->len \|\| !is_tchar(s->src[s->cur]))
74	return -1; // Error! Missing token
75	s->cur++;
76
77	// Consume additional tchars following the first one
78	while (s->cur < s->len && is_tchar(s->src[s->cur]))
79	s->cur++;
80
81	// Make a substring of the method
82	*method = (String) {
83	s->src + method_off,
84	s->cur - method_off
85	};
86
87	// Consume a space or fail
88	if (s->cur == s->len \|\| s->src[s->cur] != ' ')
89	return -1;
90	s->cur++;
91
92	// All good!
93	return 0;
94	}
95
96	int parse_target(Scanner s, String target)
97	{
98	int off = s->cur;
99
100	while (s->cur < s->len && s->src[s->cur] != ' ')
101	s->cur++;
102
103	*target = (String) {
104	s->src + off,
105	s->cur - off,
106	};
107
108	if (s->cur == s->len)
109	return -1;
110	s->cur++;
111
112	return 0;
113	}
114
115	int parse_version(Scanner s, int minor)
116	{
117	if (consume_str(s, S("HTTP/1.0\r\n"))) {
118	*minor = 0;
119	return 0;
120	}
121
122	if (consume_str(s, S("HTTP/1.1\r\n"))) {
123	*minor = 1;
124	return 0;
125	}
126
127	return -1;
128	}
129
130	bool is_vchar(char c)
131	{
132	return c >= ' ' && c <= '~';
133	}
134
135	int parse_header(Scanner s, Header header)
136	{
137	// Parse the name
138
139	int name_off = s->cur;
140	if (s->cur == s->len \|\| !is_tchar(s->src[s->cur]))
141	return -1;
142	s->cur++;
143
144	while (s->cur < s->len && is_tchar(s->src[s->cur]))
145	s->cur++;
146	header->name = (String) { s->src + name_off, s->cur - name_off };
147
148	// Consume the separator
149	if (s->cur == s->len \|\| s->src[s->cur] != ':')
150	return -1;
151	s->cur++;
152
153	// Consume whitespace preceding the value
154	while (s->cur < s->len && (s->src[s->cur] == ' ' \|\| s->src[s->cur] == '\t'))
155	s->cur++;
156
157	// Parse the value
158
159	int value_off = s->cur;
160
161	// Consume all VCHARs and spaces
162	while (s->cur < s->len && (is_vchar(s->src[s->cur]) \|\| s->src[s->cur] == ' ' \|\| s->src[s->cur] == '\t'))
163	s->cur++;
164
165	// If the body ended with some spaces, remove them. Note how this loop
166	// doesn't have bound checks. We can do this because we know the header
167	// contains at least one tchar
168	while (s->src[s->cur] == ' ' \|\| s->src[s->cur] == '\t')
169	s->cur--;
170
171	// Make a slice for the value
172	header->value = (String) { s->src + value_off, s->cur - value_off };
173
174	// Consume any spaces that follow the value
175	while (s->cur < s->len && (s->src[s->cur] == ' ' \|\| s->src[s->cur] == '\t'))
176	s->cur++;
177
178	// Consume the CRLF following the header field
179
180	if (1 >= s->len - s->cur
181	\|\| s->src[s->cur+0] != '\r'
182	\|\| s->src[s->cur+1] != '\n')
183	return -1;
184	s->cur += 2;
185
186	return 0;
187	}
188
189	int parse_header_list(Scanner s, Header headers, int max_headers)
190	{
191	int num_headers = 0;
192	while (!consume_str(s, S("\r\n"))) {
193
194	if (num_headers == max_headers)
195	return -1;
196
197	int ret = parse_header(s, &headers[num_headers]);
198	if (ret < 0)
199	return -1;
200
201	num_headers++;
202	}
203
204	return num_headers;
205	}
206
207	int parse_request(String src, Request *req)
208	{
209	Scanner s = { src.ptr, src.len, 0 };
210
211	if (parse_method(&s, &req->method) < 0)
212	return -1;
213
214	if (parse_target(&s, &req->target) < 0)
215	return -1;
216
217	if (parse_version(&s, &req->minor) < 0)
218	return -1;
219
220	int ret = parse_header_list(&s, req->headers, (int) COUNT(req->headers));
221	if (ret < 0)
222	return -1;
223	req->num_headers = ret;
224
225	return 0;
226	}
227
228	int main(void)
229	{
230	String str = S("GET /index.html HTTP/1.1\r\nHost: coz.is\r\nUser-Agent: curl/7.81.0\r\n\r\n");
231
232	Request req;
233	if (parse_request(str, &req) < 0) {
234	printf("Parsing failed\n");
235	return -1;
236	}
237
238	printf("method: %.*s\n", UNPACK(req.method));
239	printf("target: %.*s\n", UNPACK(req.target));
240	printf("version: HTTP/1.%d\n", req.minor);
241	printf("headers:\n");
242	for (int i = 0; i < req.num_headers; i++)
243	printf("name: %.s, value: %.s\n", UNPACK(req.headers[i].name), UNPACK(req.headers[i].value));
244
245	return 0;
246	}
247