Writing an HTTP compliant parser in C
Recently I published a library called cHTTP on GitHub, which is essentially my private networking code polished and made reusable. Yesterday I decided I would go through the HTTP request parser and make sure it is compliant to spec, so I took the opportunity to write about it. In this post I will go throgh the specification to turn it into an HTTP request parser using C.
This article will focus more on following the specification than the actual parsing code. If you're not familiar with string parsing in C, consider reading my post about the scanner pattern where I document my way of doing things. Also, if for some reason you're not familiar with HTTP protocol or it's syntax, you should go read this post.
How to Read the Spec
The HTTP protocol is standardized by documents called RFCs. Often new documents that obsolete old ones are published. For instance the first RFC for HTTP 1.1 was RFC 2616, but the latest specification at the time of writing is RFC 9112.
If we take a look at RFC 9112, we'll find that in section 2 the syntax of an "HTTP message" is defined:
HTTP-message = start-line CRLF
*( field-line CRLF )
CRLF
[ message-body ]
It's expressed using a notation called Augmented Backus-Naur Form (ABNF) and this is essentially what we'll need to turn into code. I'll go into its details as we write code, but if you want to learn about it in more detail, read section 1.2.
The basic idea is that every "token" (start-line, CRLF, field-line, message-body) represents a pattern of characters called "rules". Token representing rules can be grouped into higher level rules, such as this one. Tokens with uppercase names usually are primitive tokens that represent simple characters. The special characters such as parenthesis or the asterisk specify how these tokens combine to make the higher level rule. For instance, the asterisk means that the tokens following it can repeat zero or more times and the square brackets mean that the tokens inside it are optional.
So our job will be to read the document, unpack these rules and turn them into code. Now let's dive in!
Parsing the Request Line
Let's make sense of that HTTP-message
rule by comparing it to a practical example of HTTP request
HTTP-message = start-line CRLF
*( field-line CRLF )
CRLF
[ message-body ]
GET /index.html HTTP/1.1\r\n
Host: coz.is\r\n
User-Agent: curl/7.81.0\r\n
\r\n
A quick string search of CRLF shows that it is defined in section 1.2 as a carriage return and line feed \r\n
. So intuitively we can observe that:
start-line => GET /index.html HTTP/1.1
field-line => Host: coz.is
field-line => User-Agent: curl/7.81.0
message-body => (empty)
The start-line
matches the line where method, path, and version are specified, the field-line
matches the headers, and the message-body
matches the body (which is not present here).
Let's unpack the rule further by looking at how start-line
is defined (as always, we do a string search in the document):
start-line = request-line / status-line
request-line = method SP request-target SP HTTP-version
The slash /
means only one of the two rules apply. Since we are parsing HTTP requests, we are interested in the request-line
rule. Just like the CRLF token, the SP token is defined in section 1.2 as a single space.
method => GET
request-target => /index.html
HTTP-version => HTTP/1.1
Parsing the Method
If we look for the method
rule, we find that it's definition references some other document called [HTTP]
method = token
token = <token, see [HTTP], Section 5.6.2>
If we follow the reference (look for [HTTP]
in the references section) we find that the rule is defined in RFC 9110
token = 1*tchar
tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*"
/ "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
/ DIGIT / ALPHA
; any VCHAR, except delimiters
Now we have a good idea of how to parse the method. We want to parse one or more tchars followed by a space! That's easy enough:
1 | |
2 | // DIGIT |
3 | bool is_digit(char c) |
4 | { |
5 | return c >= '0' && c <= '9'; |
6 | } |
7 | |
8 | // ALPHA |
9 | bool is_alpha(char c) |
10 | { |
11 | return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'); |
12 | } |
13 | |
14 | // tchar |
15 | bool is_tchar(char c) |
16 | { |
17 | return c == '!' || c == '#' || c == '$' || c == '%' || c == '&' |
18 | || c == '\'' || c == '*' || c == '+' || c == '-' || c == '.' |
19 | || c == '^' || c == '_' || c == '`' || c == '|' || c == '~' |
20 | || is_digit(c) || is_alpha(c); |
21 | } |
22 | |
23 | int parse_method(Scanner *s, String *method) |
24 | { |
25 | int method_off = s->cur; |
26 | |
27 | // Consume a tchar or fail |
28 | if (s->cur == s->len || !is_tchar(s->src[s->cur])) |
29 | return -1; // Error! Missing token |
30 | s->cur++; |
31 | |
32 | // Consume additional tchars following the first one |
33 | while (s->cur < s->len && is_tchar(s->src[s->cur])) |
34 | s->cur++; |
35 | |
36 | // Make a substring of the method |
37 | *method = (String) { |
38 | s->src + method_off, |
39 | s->cur - method_off |
40 | }; |
41 | |
42 | // Consume a space or fail |
43 | if (s->cur == s->len || s->src[s->cur] != ' ') |
44 | return -1; |
45 | s->cur++; |
46 | |
47 | // All good! |
48 | return 0; |
49 | } |
50 |
Parsing the Request Target
Now that we parsed the method, let's go back to the request-line
rule:
request-line = method SP request-target SP HTTP-version
The next rule is request-target
. I happen to know that this one is quite complex. This rule alone takes more code than the rest of the parser, so we can take a shortcut and say that the request target is anything that goes from the first space to the second one preceding the version. This isn't ideal but most server do it. Note that it's always safer to parse strings by explicitly allowing a set of characters, which is not what we are doing here as we are letting though anything that is not a space.
With this simplification parsing the target should be quite straight-forward
1 | int parse_target(Scanner *s, String *target) |
2 | { |
3 | int off = s->cur; |
4 | |
5 | while (s->cur < s->len && s->src[s->cur] != ' ') |
6 | s->cur++; |
7 | |
8 | *target = (String) { |
9 | s->src + off, |
10 | s->cur - off, |
11 | }; |
12 | |
13 | if (s->cur == s->len) |
14 | return -1; |
15 | s->cur++; |
16 | |
17 | return 0; |
18 | } |
19 |
Parsing the Version
Following the target comes the HTTP-version rule:
HTTP-name = %x48.54.54.50 ; HTTP
HTTP-version = HTTP-name "/" DIGIT "." DIGIT
The %x
notation just means a sequence of character specified by their hex representation. If you look the numbers in the ASCII table, you will find that they simpli represent the string "HTTP". I think they used this notation to express that the string must be uppercase. Following the "HTTP" string comes the "/" character and then a major version number and a minor version number separated by a dot. Since this is an HTTP 1.1 parser we only allow versions "1.1" and "1.0", which means the rule might as well be this for us
HTTP-version = "HTTP/1.0" / "HTTP/1.1"
For this, we can rely on our super useful consume_str
function defined in the scanner pattern post
1 | bool consume_str(Scanner *s, String x) |
2 | { |
3 | if (x.len == 0) |
4 | return false; |
5 | |
6 | if (x.len > s.len - s.cur) |
7 | return false; |
8 | |
9 | for (int i = 0; i < x.len; i++) |
10 | if (s->src[s->cur+i] != x.ptr[i]) |
11 | return false; |
12 | |
13 | s->cur += x.len; |
14 | return true; |
15 | } |
16 | |
17 | int parse_version(Scanner *s, int *minor) |
18 | { |
19 | if (consume_str(s, S("HTTP/1.0\r\n"))) { |
20 | *minor = 0; |
21 | return 0; |
22 | } |
23 | |
24 | if (consume_str(s, S("HTTP/1.1\r\n"))) { |
25 | *minor = 1; |
26 | return 0; |
27 | } |
28 | |
29 | return -1; |
30 | } |
31 |
Since the major version is always 1, we just need to return the minor version. Also as we did with the method and target parsing function, we make the version function also consume the CRLF separator that follows it.
And this completes are parsing of the request line!
Parsing Headers
Next come the field-line
rule
HTTP-message = start-line CRLF
*( field-line CRLF )
CRLF
[ message-body ]
This rule can occur zero or more times, which means we'll need to write a function to parse a single field-line
and then call it in a loop.
A single field-line
is defined as this:
field-line = field-name ":" OWS field-value OWS
We have a name and a value separated by a semicolon. The value is padded by spaces (string search OWS in the document). The name and value rules are defined in RFC 9110
field-name = <field-name, see [HTTP], Section 5.1>
field-value = <field-value, see [HTTP], Section 5.5>
Following the references we get to these definitions
field-name = token
field-value = *field-content
field-content = field-vchar
[ 1*( SP / HTAB / field-vchar ) field-vchar ]
field-vchar = VCHAR / obs-text
obs-text = %x80-FF
The name rule is simple enough, while the value rule is quite verbose. Fortunately, that's just a way of saying that the header value can't contain leading and trailing spaces. Other than that it can contain spaces, tabs, any visible ASCII caracters (referred to as VCHARs) or characters with values 0x80 to 0xFF that correspond to UTF-8 character bytes. Unfortunately there is a lot of weirdness with UTF-8 so I prefer to only allow ASCII.
1 | typedef struct { |
2 | String name; |
3 | String value; |
4 | } Header; |
5 | |
6 | bool is_vchar(char c) |
7 | { |
8 | return c >= ' ' && c <= '~'; |
9 | } |
10 | |
11 | int parse_header(Scanner *s, Header *header) |
12 | { |
13 | // Parse the name |
14 | |
15 | int name_off = s->cur; |
16 | if (s->cur == s->len || !is_tchar(s->src[s->cur])) |
17 | return -1; |
18 | s->cur++; |
19 | |
20 | while (s->cur < s->len && is_tchar(s->src[s->cur])) |
21 | s->cur++; |
22 | header->name = (String) { s->src + name_off, s->cur - name_off }; |
23 | |
24 | // Consume the separator |
25 | if (s->cur == s->len || s->src[s->cur] != ':') |
26 | return -1; |
27 | s->cur++; |
28 | |
29 | // Consume whitespace preceding the value |
30 | while (s->cur < s->len && (s->src[s->cur] == ' ' || s->src[s->cur] == '\t')) |
31 | s->cur++; |
32 | |
33 | // Parse the value |
34 | |
35 | int value_off = s->cur; |
36 | |
37 | // Consume all VCHARs and spaces |
38 | while (s->cur < s->len && (is_vchar(s->src[s->cur]) || s->src[s->cur] == ' ' || s->src[s->cur] == '\t')) |
39 | s->cur++; |
40 | |
41 | // If the body ended with some spaces, remove them. Note how this loop |
42 | // doesn't have bound checks. We can do this because we know the header |
43 | // contains at least one tchar |
44 | while (s->src[s->cur] == ' ' || s->src[s->cur] == '\t') |
45 | s->cur--; |
46 | |
47 | // Make a slice for the value |
48 | header->value = (String) { s->src + value_off, s->cur - value_off }; |
49 | |
50 | // Consume any spaces that follow the value |
51 | while (s->cur < s->len && (s->src[s->cur] == ' ' || s->src[s->cur] == '\t')) |
52 | s->cur++; |
53 | |
54 | // Consume the CRLF following the header field |
55 | |
56 | if (1 >= s->len - s->cur |
57 | || s->src[s->cur+0] != '\r' |
58 | || s->src[s->cur+1] != '\n') |
59 | return -1; |
60 | s->cur += 2; |
61 | |
62 | return 0; |
63 | } |
64 |
As we were saying earlier, the header rule may apply multiple times until the final CRLF
that precedes the body is found. Let's define a parsing function for the entire header list:
1 | int parse_header_list(Scanner *s, Header *headers, int max_headers) |
2 | { |
3 | int num_headers = 0; |
4 | while (!consume_str(s, S("\r\n"))) { |
5 | |
6 | if (num_headers == max_headers) |
7 | return -1; |
8 | |
9 | int ret = parse_header(s, &headers[num_headers]); |
10 | if (ret < 0) |
11 | return -1; |
12 | |
13 | num_headers++; |
14 | } |
15 | |
16 | return num_headers; |
17 | } |
18 |
Parsing the Body
The last thing to do is parse the message-body
rule, which is defined in section 6 of RFC 9112
message-body = *OCTET
by octet the RFC means a generic byte, which means the body is expressed as a generic stream of data. The length of the data depends on the header values. If the request contains the Content-Length
header, then the body is a single chunk of bytes. If the request contains the Transfer-Encoding
header, then the body is actually a list of chunks each with its own header that needs parsing. This alone, as the request target parsing does, warrants its own article! So for now, you'll need to try implement body parsing on your own :)
Putting Everything Together
And here it is, the complete parser with a basic test of our initial input!
1 | #include <stdio.h> |
2 | #include <stdbool.h> |
3 | |
4 | #define MAX_HEADERS 128 |
5 | |
6 | #define S(X) (String) { (X), (int) sizeof(X)-1 } |
7 | #define UNPACK(X) (X).len, (X).ptr |
8 | #define COUNT(X) (sizeof(X) / sizeof((X)[0])) |
9 | |
10 | typedef struct { |
11 | char *src; |
12 | int len; |
13 | int cur; |
14 | } Scanner; |
15 | |
16 | typedef struct { |
17 | char *ptr; |
18 | int len; |
19 | } String; |
20 | |
21 | typedef struct { |
22 | String name; |
23 | String value; |
24 | } Header; |
25 | |
26 | typedef struct { |
27 | String method; |
28 | String target; |
29 | int minor; |
30 | int num_headers; |
31 | Header headers[MAX_HEADERS]; |
32 | } Request; |
33 | |
34 | bool is_digit(char c) |
35 | { |
36 | return c >= '0' && c <= '9'; |
37 | } |
38 | |
39 | bool is_alpha(char c) |
40 | { |
41 | return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'); |
42 | } |
43 | |
44 | bool is_tchar(char c) |
45 | { |
46 | return c == '!' || c == '#' || c == '$' || c == '%' || c == '&' |
47 | || c == '\'' || c == '*' || c == '+' || c == '-' || c == '.' |
48 | || c == '^' || c == '_' || c == '`' || c == '|' || c == '~' |
49 | || is_digit(c) || is_alpha(c); |
50 | } |
51 | |
52 | bool consume_str(Scanner *s, String x) |
53 | { |
54 | if (x.len == 0) |
55 | return false; |
56 | |
57 | if (x.len > s->len - s->cur) |
58 | return false; |
59 | |
60 | for (int i = 0; i < x.len; i++) |
61 | if (s->src[s->cur+i] != x.ptr[i]) |
62 | return false; |
63 | |
64 | s->cur += x.len; |
65 | return true; |
66 | } |
67 | |
68 | int parse_method(Scanner *s, String *method) |
69 | { |
70 | int method_off = s->cur; |
71 | |
72 | // Consume a tchar or fail |
73 | if (s->cur == s->len || !is_tchar(s->src[s->cur])) |
74 | return -1; // Error! Missing token |
75 | s->cur++; |
76 | |
77 | // Consume additional tchars following the first one |
78 | while (s->cur < s->len && is_tchar(s->src[s->cur])) |
79 | s->cur++; |
80 | |
81 | // Make a substring of the method |
82 | *method = (String) { |
83 | s->src + method_off, |
84 | s->cur - method_off |
85 | }; |
86 | |
87 | // Consume a space or fail |
88 | if (s->cur == s->len || s->src[s->cur] != ' ') |
89 | return -1; |
90 | s->cur++; |
91 | |
92 | // All good! |
93 | return 0; |
94 | } |
95 | |
96 | int parse_target(Scanner *s, String *target) |
97 | { |
98 | int off = s->cur; |
99 | |
100 | while (s->cur < s->len && s->src[s->cur] != ' ') |
101 | s->cur++; |
102 | |
103 | *target = (String) { |
104 | s->src + off, |
105 | s->cur - off, |
106 | }; |
107 | |
108 | if (s->cur == s->len) |
109 | return -1; |
110 | s->cur++; |
111 | |
112 | return 0; |
113 | } |
114 | |
115 | int parse_version(Scanner *s, int *minor) |
116 | { |
117 | if (consume_str(s, S("HTTP/1.0\r\n"))) { |
118 | *minor = 0; |
119 | return 0; |
120 | } |
121 | |
122 | if (consume_str(s, S("HTTP/1.1\r\n"))) { |
123 | *minor = 1; |
124 | return 0; |
125 | } |
126 | |
127 | return -1; |
128 | } |
129 | |
130 | bool is_vchar(char c) |
131 | { |
132 | return c >= ' ' && c <= '~'; |
133 | } |
134 | |
135 | int parse_header(Scanner *s, Header *header) |
136 | { |
137 | // Parse the name |
138 | |
139 | int name_off = s->cur; |
140 | if (s->cur == s->len || !is_tchar(s->src[s->cur])) |
141 | return -1; |
142 | s->cur++; |
143 | |
144 | while (s->cur < s->len && is_tchar(s->src[s->cur])) |
145 | s->cur++; |
146 | header->name = (String) { s->src + name_off, s->cur - name_off }; |
147 | |
148 | // Consume the separator |
149 | if (s->cur == s->len || s->src[s->cur] != ':') |
150 | return -1; |
151 | s->cur++; |
152 | |
153 | // Consume whitespace preceding the value |
154 | while (s->cur < s->len && (s->src[s->cur] == ' ' || s->src[s->cur] == '\t')) |
155 | s->cur++; |
156 | |
157 | // Parse the value |
158 | |
159 | int value_off = s->cur; |
160 | |
161 | // Consume all VCHARs and spaces |
162 | while (s->cur < s->len && (is_vchar(s->src[s->cur]) || s->src[s->cur] == ' ' || s->src[s->cur] == '\t')) |
163 | s->cur++; |
164 | |
165 | // If the body ended with some spaces, remove them. Note how this loop |
166 | // doesn't have bound checks. We can do this because we know the header |
167 | // contains at least one tchar |
168 | while (s->src[s->cur] == ' ' || s->src[s->cur] == '\t') |
169 | s->cur--; |
170 | |
171 | // Make a slice for the value |
172 | header->value = (String) { s->src + value_off, s->cur - value_off }; |
173 | |
174 | // Consume any spaces that follow the value |
175 | while (s->cur < s->len && (s->src[s->cur] == ' ' || s->src[s->cur] == '\t')) |
176 | s->cur++; |
177 | |
178 | // Consume the CRLF following the header field |
179 | |
180 | if (1 >= s->len - s->cur |
181 | || s->src[s->cur+0] != '\r' |
182 | || s->src[s->cur+1] != '\n') |
183 | return -1; |
184 | s->cur += 2; |
185 | |
186 | return 0; |
187 | } |
188 | |
189 | int parse_header_list(Scanner *s, Header *headers, int max_headers) |
190 | { |
191 | int num_headers = 0; |
192 | while (!consume_str(s, S("\r\n"))) { |
193 | |
194 | if (num_headers == max_headers) |
195 | return -1; |
196 | |
197 | int ret = parse_header(s, &headers[num_headers]); |
198 | if (ret < 0) |
199 | return -1; |
200 | |
201 | num_headers++; |
202 | } |
203 | |
204 | return num_headers; |
205 | } |
206 | |
207 | int parse_request(String src, Request *req) |
208 | { |
209 | Scanner s = { src.ptr, src.len, 0 }; |
210 | |
211 | if (parse_method(&s, &req->method) < 0) |
212 | return -1; |
213 | |
214 | if (parse_target(&s, &req->target) < 0) |
215 | return -1; |
216 | |
217 | if (parse_version(&s, &req->minor) < 0) |
218 | return -1; |
219 | |
220 | int ret = parse_header_list(&s, req->headers, (int) COUNT(req->headers)); |
221 | if (ret < 0) |
222 | return -1; |
223 | req->num_headers = ret; |
224 | |
225 | return 0; |
226 | } |
227 | |
228 | int main(void) |
229 | { |
230 | String str = S("GET /index.html HTTP/1.1\r\nHost: coz.is\r\nUser-Agent: curl/7.81.0\r\n\r\n"); |
231 | |
232 | Request req; |
233 | if (parse_request(str, &req) < 0) { |
234 | printf("Parsing failed\n"); |
235 | return -1; |
236 | } |
237 | |
238 | printf("method: %.*s\n", UNPACK(req.method)); |
239 | printf("target: %.*s\n", UNPACK(req.target)); |
240 | printf("version: HTTP/1.%d\n", req.minor); |
241 | printf("headers:\n"); |
242 | for (int i = 0; i < req.num_headers; i++) |
243 | printf("name: %.*s, value: %.*s\n", UNPACK(req.headers[i].name), UNPACK(req.headers[i].value)); |
244 | |
245 | return 0; |
246 | } |
247 |