Cozis

Writing an HTTP compliant parser in C

Recently I published a library called cHTTP on GitHub, which is essentially my private networking code polished and made reusable. Yesterday I decided I would go through the HTTP request parser and make sure it is compliant to spec, so I took the opportunity to write about it. In this post I will go throgh the specification to turn it into an HTTP request parser using C.

This article will focus more on following the specification than the actual parsing code. If you're not familiar with string parsing in C, consider reading my post about the scanner pattern where I document my way of doing things. Also, if for some reason you're not familiar with HTTP protocol or it's syntax, you should go read this post.

How to Read the Spec

The HTTP protocol is standardized by documents called RFCs. Often new documents that obsolete old ones are published. For instance the first RFC for HTTP 1.1 was RFC 2616, but the latest specification at the time of writing is RFC 9112.

If we take a look at RFC 9112, we'll find that in section 2 the syntax of an "HTTP message" is defined:

HTTP-message = start-line CRLF
               *( field-line CRLF )
               CRLF
               [ message-body ]

It's expressed using a notation called Augmented Backus-Naur Form (ABNF) and this is essentially what we'll need to turn into code. I'll go into its details as we write code, but if you want to learn about it in more detail, read section 1.2.

The basic idea is that every "token" (start-line, CRLF, field-line, message-body) represents a pattern of characters called "rules". Token representing rules can be grouped into higher level rules, such as this one. Tokens with uppercase names usually are primitive tokens that represent simple characters. The special characters such as parenthesis or the asterisk specify how these tokens combine to make the higher level rule. For instance, the asterisk means that the tokens following it can repeat zero or more times and the square brackets mean that the tokens inside it are optional.

So our job will be to read the document, unpack these rules and turn them into code. Now let's dive in!

Parsing the Request Line

Let's make sense of that HTTP-message rule by comparing it to a practical example of HTTP request

HTTP-message = start-line CRLF
               *( field-line CRLF )
               CRLF
               [ message-body ]

GET /index.html HTTP/1.1\r\n
Host: coz.is\r\n
User-Agent: curl/7.81.0\r\n
\r\n

A quick string search of CRLF shows that it is defined in section 1.2 as a carriage return and line feed \r\n. So intuitively we can observe that:

start-line   => GET /index.html HTTP/1.1
field-line   => Host: coz.is
field-line   => User-Agent: curl/7.81.0
message-body => (empty)

The start-line matches the line where method, path, and version are specified, the field-line matches the headers, and the message-body matches the body (which is not present here).

Let's unpack the rule further by looking at how start-line is defined (as always, we do a string search in the document):

start-line = request-line / status-line

request-line = method SP request-target SP HTTP-version

The slash / means only one of the two rules apply. Since we are parsing HTTP requests, we are interested in the request-line rule. Just like the CRLF token, the SP token is defined in section 1.2 as a single space.

method         => GET
request-target => /index.html
HTTP-version   => HTTP/1.1

Parsing the Method

If we look for the method rule, we find that it's definition references some other document called [HTTP]

method = token

token = <token, see [HTTP], Section 5.6.2>

If we follow the reference (look for [HTTP] in the references section) we find that the rule is defined in RFC 9110

token = 1*tchar

tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*"
      / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
      / DIGIT / ALPHA
      ; any VCHAR, except delimiters

Now we have a good idea of how to parse the method. We want to parse one or more tchars followed by a space! That's easy enough:

1
2// DIGIT
3bool is_digit(char c)
4{
5 return c >= '0' && c <= '9';
6}
7
8// ALPHA
9bool is_alpha(char c)
10{
11 return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
12}
13
14// tchar
15bool is_tchar(char c)
16{
17 return c == '!' || c == '#' || c == '$' || c == '%' || c == '&'
18 || c == '\'' || c == '*' || c == '+' || c == '-' || c == '.'
19 || c == '^' || c == '_' || c == '`' || c == '|' || c == '~'
20 || is_digit(c) || is_alpha(c);
21}
22
23int parse_method(Scanner *s, String *method)
24{
25 int method_off = s->cur;
26
27 // Consume a tchar or fail
28 if (s->cur == s->len || !is_tchar(s->src[s->cur]))
29 return -1; // Error! Missing token
30 s->cur++;
31
32 // Consume additional tchars following the first one
33 while (s->cur < s->len && is_tchar(s->src[s->cur]))
34 s->cur++;
35
36 // Make a substring of the method
37 *method = (String) {
38 s->src + method_off,
39 s->cur - method_off
40 };
41
42 // Consume a space or fail
43 if (s->cur == s->len || s->src[s->cur] != ' ')
44 return -1;
45 s->cur++;
46
47 // All good!
48 return 0;
49}
50

Parsing the Request Target

Now that we parsed the method, let's go back to the request-line rule:

request-line = method SP request-target SP HTTP-version

The next rule is request-target. I happen to know that this one is quite complex. This rule alone takes more code than the rest of the parser, so we can take a shortcut and say that the request target is anything that goes from the first space to the second one preceding the version. This isn't ideal but most server do it. Note that it's always safer to parse strings by explicitly allowing a set of characters, which is not what we are doing here as we are letting though anything that is not a space.

With this simplification parsing the target should be quite straight-forward

1int parse_target(Scanner *s, String *target)
2{
3 int off = s->cur;
4
5 while (s->cur < s->len && s->src[s->cur] != ' ')
6 s->cur++;
7
8 *target = (String) {
9 s->src + off,
10 s->cur - off,
11 };
12
13 if (s->cur == s->len)
14 return -1;
15 s->cur++;
16
17 return 0;
18}
19

Parsing the Version

Following the target comes the HTTP-version rule:

HTTP-name = %x48.54.54.50 ; HTTP
HTTP-version = HTTP-name "/" DIGIT "." DIGIT

The %x notation just means a sequence of character specified by their hex representation. If you look the numbers in the ASCII table, you will find that they simpli represent the string "HTTP". I think they used this notation to express that the string must be uppercase. Following the "HTTP" string comes the "/" character and then a major version number and a minor version number separated by a dot. Since this is an HTTP 1.1 parser we only allow versions "1.1" and "1.0", which means the rule might as well be this for us

HTTP-version = "HTTP/1.0" / "HTTP/1.1"

For this, we can rely on our super useful consume_str function defined in the scanner pattern post

1bool consume_str(Scanner *s, String x)
2{
3 if (x.len == 0)
4 return false;
5
6 if (x.len > s.len - s.cur)
7 return false;
8
9 for (int i = 0; i < x.len; i++)
10 if (s->src[s->cur+i] != x.ptr[i])
11 return false;
12
13 s->cur += x.len;
14 return true;
15}
16
17int parse_version(Scanner *s, int *minor)
18{
19 if (consume_str(s, S("HTTP/1.0\r\n"))) {
20 *minor = 0;
21 return 0;
22 }
23
24 if (consume_str(s, S("HTTP/1.1\r\n"))) {
25 *minor = 1;
26 return 0;
27 }
28
29 return -1;
30}
31

Since the major version is always 1, we just need to return the minor version. Also as we did with the method and target parsing function, we make the version function also consume the CRLF separator that follows it.

And this completes are parsing of the request line!

Parsing Headers

Next come the field-line rule

HTTP-message = start-line CRLF
               *( field-line CRLF )
               CRLF
               [ message-body ]

This rule can occur zero or more times, which means we'll need to write a function to parse a single field-line and then call it in a loop.

A single field-line is defined as this:

field-line = field-name ":" OWS field-value OWS

We have a name and a value separated by a semicolon. The value is padded by spaces (string search OWS in the document). The name and value rules are defined in RFC 9110

field-name = <field-name, see [HTTP], Section 5.1>
field-value = <field-value, see [HTTP], Section 5.5>

Following the references we get to these definitions

field-name = token

field-value    = *field-content
field-content  = field-vchar
                 [ 1*( SP / HTAB / field-vchar ) field-vchar ]
field-vchar    = VCHAR / obs-text
obs-text       = %x80-FF

The name rule is simple enough, while the value rule is quite verbose. Fortunately, that's just a way of saying that the header value can't contain leading and trailing spaces. Other than that it can contain spaces, tabs, any visible ASCII caracters (referred to as VCHARs) or characters with values 0x80 to 0xFF that correspond to UTF-8 character bytes. Unfortunately there is a lot of weirdness with UTF-8 so I prefer to only allow ASCII.

1typedef struct {
2 String name;
3 String value;
4} Header;
5
6bool is_vchar(char c)
7{
8 return c >= ' ' && c <= '~';
9}
10
11int parse_header(Scanner *s, Header *header)
12{
13 // Parse the name
14
15 int name_off = s->cur;
16 if (s->cur == s->len || !is_tchar(s->src[s->cur]))
17 return -1;
18 s->cur++;
19
20 while (s->cur < s->len && is_tchar(s->src[s->cur]))
21 s->cur++;
22 header->name = (String) { s->src + name_off, s->cur - name_off };
23
24 // Consume the separator
25 if (s->cur == s->len || s->src[s->cur] != ':')
26 return -1;
27 s->cur++;
28
29 // Consume whitespace preceding the value
30 while (s->cur < s->len && (s->src[s->cur] == ' ' || s->src[s->cur] == '\t'))
31 s->cur++;
32
33 // Parse the value
34
35 int value_off = s->cur;
36
37 // Consume all VCHARs and spaces
38 while (s->cur < s->len && (is_vchar(s->src[s->cur]) || s->src[s->cur] == ' ' || s->src[s->cur] == '\t'))
39 s->cur++;
40
41 // If the body ended with some spaces, remove them. Note how this loop
42 // doesn't have bound checks. We can do this because we know the header
43 // contains at least one tchar
44 while (s->src[s->cur] == ' ' || s->src[s->cur] == '\t')
45 s->cur--;
46
47 // Make a slice for the value
48 header->value = (String) { s->src + value_off, s->cur - value_off };
49
50 // Consume any spaces that follow the value
51 while (s->cur < s->len && (s->src[s->cur] == ' ' || s->src[s->cur] == '\t'))
52 s->cur++;
53
54 // Consume the CRLF following the header field
55
56 if (1 >= s->len - s->cur
57 || s->src[s->cur+0] != '\r'
58 || s->src[s->cur+1] != '\n')
59 return -1;
60 s->cur += 2;
61
62 return 0;
63}
64

As we were saying earlier, the header rule may apply multiple times until the final CRLF that precedes the body is found. Let's define a parsing function for the entire header list:

1int parse_header_list(Scanner *s, Header *headers, int max_headers)
2{
3 int num_headers = 0;
4 while (!consume_str(s, S("\r\n"))) {
5
6 if (num_headers == max_headers)
7 return -1;
8
9 int ret = parse_header(s, &headers[num_headers]);
10 if (ret < 0)
11 return -1;
12
13 num_headers++;
14 }
15
16 return num_headers;
17}
18

Parsing the Body

The last thing to do is parse the message-body rule, which is defined in section 6 of RFC 9112

message-body = *OCTET

by octet the RFC means a generic byte, which means the body is expressed as a generic stream of data. The length of the data depends on the header values. If the request contains the Content-Length header, then the body is a single chunk of bytes. If the request contains the Transfer-Encoding header, then the body is actually a list of chunks each with its own header that needs parsing. This alone, as the request target parsing does, warrants its own article! So for now, you'll need to try implement body parsing on your own :)

Putting Everything Together

And here it is, the complete parser with a basic test of our initial input!

1#include <stdio.h>
2#include <stdbool.h>
3
4#define MAX_HEADERS 128
5
6#define S(X) (String) { (X), (int) sizeof(X)-1 }
7#define UNPACK(X) (X).len, (X).ptr
8#define COUNT(X) (sizeof(X) / sizeof((X)[0]))
9
10typedef struct {
11 char *src;
12 int len;
13 int cur;
14} Scanner;
15
16typedef struct {
17 char *ptr;
18 int len;
19} String;
20
21typedef struct {
22 String name;
23 String value;
24} Header;
25
26typedef struct {
27 String method;
28 String target;
29 int minor;
30 int num_headers;
31 Header headers[MAX_HEADERS];
32} Request;
33
34bool is_digit(char c)
35{
36 return c >= '0' && c <= '9';
37}
38
39bool is_alpha(char c)
40{
41 return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
42}
43
44bool is_tchar(char c)
45{
46 return c == '!' || c == '#' || c == '$' || c == '%' || c == '&'
47 || c == '\'' || c == '*' || c == '+' || c == '-' || c == '.'
48 || c == '^' || c == '_' || c == '`' || c == '|' || c == '~'
49 || is_digit(c) || is_alpha(c);
50}
51
52bool consume_str(Scanner *s, String x)
53{
54 if (x.len == 0)
55 return false;
56
57 if (x.len > s->len - s->cur)
58 return false;
59
60 for (int i = 0; i < x.len; i++)
61 if (s->src[s->cur+i] != x.ptr[i])
62 return false;
63
64 s->cur += x.len;
65 return true;
66}
67
68int parse_method(Scanner *s, String *method)
69{
70 int method_off = s->cur;
71
72 // Consume a tchar or fail
73 if (s->cur == s->len || !is_tchar(s->src[s->cur]))
74 return -1; // Error! Missing token
75 s->cur++;
76
77 // Consume additional tchars following the first one
78 while (s->cur < s->len && is_tchar(s->src[s->cur]))
79 s->cur++;
80
81 // Make a substring of the method
82 *method = (String) {
83 s->src + method_off,
84 s->cur - method_off
85 };
86
87 // Consume a space or fail
88 if (s->cur == s->len || s->src[s->cur] != ' ')
89 return -1;
90 s->cur++;
91
92 // All good!
93 return 0;
94}
95
96int parse_target(Scanner *s, String *target)
97{
98 int off = s->cur;
99
100 while (s->cur < s->len && s->src[s->cur] != ' ')
101 s->cur++;
102
103 *target = (String) {
104 s->src + off,
105 s->cur - off,
106 };
107
108 if (s->cur == s->len)
109 return -1;
110 s->cur++;
111
112 return 0;
113}
114
115int parse_version(Scanner *s, int *minor)
116{
117 if (consume_str(s, S("HTTP/1.0\r\n"))) {
118 *minor = 0;
119 return 0;
120 }
121
122 if (consume_str(s, S("HTTP/1.1\r\n"))) {
123 *minor = 1;
124 return 0;
125 }
126
127 return -1;
128}
129
130bool is_vchar(char c)
131{
132 return c >= ' ' && c <= '~';
133}
134
135int parse_header(Scanner *s, Header *header)
136{
137 // Parse the name
138
139 int name_off = s->cur;
140 if (s->cur == s->len || !is_tchar(s->src[s->cur]))
141 return -1;
142 s->cur++;
143
144 while (s->cur < s->len && is_tchar(s->src[s->cur]))
145 s->cur++;
146 header->name = (String) { s->src + name_off, s->cur - name_off };
147
148 // Consume the separator
149 if (s->cur == s->len || s->src[s->cur] != ':')
150 return -1;
151 s->cur++;
152
153 // Consume whitespace preceding the value
154 while (s->cur < s->len && (s->src[s->cur] == ' ' || s->src[s->cur] == '\t'))
155 s->cur++;
156
157 // Parse the value
158
159 int value_off = s->cur;
160
161 // Consume all VCHARs and spaces
162 while (s->cur < s->len && (is_vchar(s->src[s->cur]) || s->src[s->cur] == ' ' || s->src[s->cur] == '\t'))
163 s->cur++;
164
165 // If the body ended with some spaces, remove them. Note how this loop
166 // doesn't have bound checks. We can do this because we know the header
167 // contains at least one tchar
168 while (s->src[s->cur] == ' ' || s->src[s->cur] == '\t')
169 s->cur--;
170
171 // Make a slice for the value
172 header->value = (String) { s->src + value_off, s->cur - value_off };
173
174 // Consume any spaces that follow the value
175 while (s->cur < s->len && (s->src[s->cur] == ' ' || s->src[s->cur] == '\t'))
176 s->cur++;
177
178 // Consume the CRLF following the header field
179
180 if (1 >= s->len - s->cur
181 || s->src[s->cur+0] != '\r'
182 || s->src[s->cur+1] != '\n')
183 return -1;
184 s->cur += 2;
185
186 return 0;
187}
188
189int parse_header_list(Scanner *s, Header *headers, int max_headers)
190{
191 int num_headers = 0;
192 while (!consume_str(s, S("\r\n"))) {
193
194 if (num_headers == max_headers)
195 return -1;
196
197 int ret = parse_header(s, &headers[num_headers]);
198 if (ret < 0)
199 return -1;
200
201 num_headers++;
202 }
203
204 return num_headers;
205}
206
207int parse_request(String src, Request *req)
208{
209 Scanner s = { src.ptr, src.len, 0 };
210
211 if (parse_method(&s, &req->method) < 0)
212 return -1;
213
214 if (parse_target(&s, &req->target) < 0)
215 return -1;
216
217 if (parse_version(&s, &req->minor) < 0)
218 return -1;
219
220 int ret = parse_header_list(&s, req->headers, (int) COUNT(req->headers));
221 if (ret < 0)
222 return -1;
223 req->num_headers = ret;
224
225 return 0;
226}
227
228int main(void)
229{
230 String str = S("GET /index.html HTTP/1.1\r\nHost: coz.is\r\nUser-Agent: curl/7.81.0\r\n\r\n");
231
232 Request req;
233 if (parse_request(str, &req) < 0) {
234 printf("Parsing failed\n");
235 return -1;
236 }
237
238 printf("method: %.*s\n", UNPACK(req.method));
239 printf("target: %.*s\n", UNPACK(req.target));
240 printf("version: HTTP/1.%d\n", req.minor);
241 printf("headers:\n");
242 for (int i = 0; i < req.num_headers; i++)
243 printf("name: %.*s, value: %.*s\n", UNPACK(req.headers[i].name), UNPACK(req.headers[i].value));
244
245 return 0;
246}
247