DEV Community

Cover image for How web browsers work - parsing the HTML (part 3, with illustrations)📜🔥
Arika O
Arika O

Posted on • Updated on

How web browsers work - parsing the HTML (part 3, with illustrations)📜🔥

Until now we discussed navigation and data fetching. Today we're going to talk about parsing in general and HTML parsing in particular.

3. HTML PARSING

We saw how after the initial request to the server, the browser receives a response containing the HTML resources of the webpage we are trying to access (the first chunk of data). Now the job of the browswer will be to start parsing the data.

Parsing means analyzing and converting a program into an internal format that a runtime environment can actually run.

In other words, parsing means taking the code we write as text (HTML, CSS) and transform it into something that the browser can work with. The parsing will be done by the browser engine (not to be confused with the the Javascript engine of the browser).

The browser engine is a core component of every major browser and it's main role is to combine structure (HTML) and style (CSS) so it can draw the web page on our screens. It is also responsible to find out which pieces of code are interactive. We should not think about it like a separate piece of software but as being part of a bigger sofware (in our case, the browser).

There are many browser engines in the wild but the majority of the browsers use one of these three actively developed full engines:

Gecko
It was developed by Mozilla for Firefox. In the past it used to power several other browsers but at the moment, besides Firefox, Tor and Waterfox are the only ones still using Gecko. It is written in C++ and JavaScript, and since 2016, additionally in Rust.

WebKit
It's primarily developed by Apple for Safari. It also powers GNOME Web (Epiphany) and Otter. (surprinsingly enough, on iOS, all browsers including Firefox and Chrome, are also powered by WebKit). It it written in C++.

Blink, part of Chromium
Beginning as a fork of WebKit, it's primarily developed by Google for Chrome. It also powers Edge, Brave, Silk, Vivaldi, Opera, and most other browser projects (some via QtWebEngine). It is written in C++.

Now that we understand who's going to do the parsing, let's see what happens exactly after we receive the first HTML document from the server. Let's assume the document looks like this:

<!doctype HTML>
<html>
 <head>
  <title>This is my page</title>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
  <h1>This is my page</h1>
  <h3>This is a H3 header.</h3>
  <p>This is a paragraph.</p>
  <p>This is another paragraph,</p>
</body>
</html>
Enter fullscreen mode Exit fullscreen mode

Even if the request page's HTML is larger than the initial 14KB packet, the browser will begin parsing and attempting to render an experience based on the data it has. HTML parsing involves two steps: tokenization and tree construction (building something called the DOM Tree (Document Object Model)).

Tokenization

It is the lexical analysis and it converts some input into tokens (basic components of source code). Imagine we would take an English text and break it down into words, where the words would be the tokens.

What results at the end of the tokenization process is a series of zero or more of the following tokens: DOCTYPE, start tag (<tag>), end tag (</tag>), self-closing tag (<tag/>), attribute names, values, comments, characters, end-of-file or plain text content within an element.

Image description

Building the DOM

After the first token gets created, tree building starts. This is essentially creating a tree like structure (called the Document Object Model) based on the previously parsed tokens.

The DOM tree describes the content of the HTML document. The <html> element is the first tag and root node of the document tree. The tree reflects the relationships and hierarchies between different tags. We have parent nodes and tags nested within other tags are child nodes. The greater the number of nodes, the longer it will takes to build the DOM tree. Below is the DOM Tree for the HTML document example we got from the server:

Image description

In reality, the DOM is more complex than what we see in that schema, but I kept it simple for a better undestanding (also, we'll talk in more detail about the DOM and its importance in a future article).

This building stage is reentrant, meaning that while one token is handled, the tokenizer might be resumed, causing further tokens to be emitted and processed before the first token's processing is complete. From bytes until the DOM is created, the complete process would look like something like this:

Image description

The parser works line by line, from top to bottom. When the parser will encounter non-blocking resources (for example images), the browser will request those images from the server and continue parsing. On the other hand, if it encounters blocking-resources (CSS stylesheets, Javascrpt files added in the <head> section of the HTML or fonts added from a CDN ), the parser will stop execution until all those blocking resources are downloaded. That's why, if yu're working with Javascript it is recommended to add your <script> tags at the end of the HTML file, or if you want to keep them in the <head> tag, you should add to them the defer or async attribute (async allows for asynchronous as soon as the script is downloaded and defer allows execution only after the whole document has been parsed.).

Pre-loaders and making the page faster

Internet Explorer, WebKit and Mozilla all implemented pre-loaders in 2008 as a way of dealing with blocking resources, especially scripts (we said earlier, that when encountering a script tag, the HTML parsing would stop until the script is downloaded and executed).

With a pre-loader, when the browser is stuck on a script, a second ligher parser is scanning the HTML for resources that need to be retrieved (stylesheets, scripts etc). The pre-loader then starts retrieving these resources in the background with the aim that by the time the main HTML parser reaches them they may have already been downloaded (in case these resources were already cached, this step is skipped).

Refrence materials:

  • MDN Web Docs
  • whatwg.org
  • Javascript Info
  • MDN Web Docs

Top comments (7)

Collapse
 
grafeno30 profile image
grafeno30

Arika, amazing tutorial!!. Thank you

Collapse
 
arikaturika profile image
Arika O • Edited

I'm glad you find it useful.

Collapse
 
grafeno30 profile image
grafeno30

I am IT teacher in a secondary school. My students are going to do homework of your article: translate to spanish, make a Google Slide and finally translate to English
Thank you!!!

Thread Thread
 
arikaturika profile image
Arika O

There will be more articles to this series. Wishing them good luck and lots of fun while doing the homework 🤖!

Collapse
 
alohci profile image
Nicholas Stimpson

Your diagram of the tokens and nodes is not quite right. The tokenizer doesn't form characters into words and the nodes are not broken into words. So the diagram should look like this:
Parser diagram showing the text content tokenized to individual characters and the 'Hello world!' text in a single node

Collapse
 
arikaturika profile image
Arika O

You are right, for each letter of the word, a token is emitted (I'll correct the diagram a bit later). The node representation was an oversight. Thank you for taking the time to modify the diagram :).

Collapse
 
maxim_vaarwel_7cbc84578d2 profile image
Maxim

With a pre-loader, when the browser is stuck on a script, a second ligher parser is scanning the HTML for resources that need to be retrieved (stylesheets, scripts etc).

Have you ever tested your words? If script is performing too long stoping the rest part of page, the speculative parser (pre-loader) doesn't work. Speculative parser is stopped before script will be executed.

Why you didn't read the spec and didn't test it?

More from Arika O

Tree shaking in Javascript - short explanation Early return pattern in JavaScript Manipulating the DOM using Javascript - traversing the DOM(part 2)✂️🕹

哆哆女性网金姓少女起名起点小说排行榜前十名饺子馆的起名字新材料公司起名字大全免费个人教学工作总结南风慢慢起小说大气层厚度鱼火肴加盟费保利金爵与土有关的字公司起名她他社黄石大上海广场夜总会起名起取个英文名字男生电工老张能源公司起名字大全免费厨房用品商标起名埃博拉前线演员表律政强人百度云师父百度云可以起名字的古诗词2016年猴宝宝起名大全适合可以起名字的诗词amazon.de瑞字起名男孩人名起名四字个体工商户名字怎么起集团公司怎么起名管道之色戒汪姓起男孩子名淀粉肠小王子日销售额涨超10倍罗斯否认插足凯特王妃婚姻不负春光新的一天从800个哈欠开始有个姐真把千机伞做出来了国产伟哥去年销售近13亿充个话费竟沦为间接洗钱工具重庆警方辟谣“男子杀人焚尸”男子给前妻转账 现任妻子起诉要回春分繁花正当时呼北高速交通事故已致14人死亡杨洋拄拐现身医院月嫂回应掌掴婴儿是在赶虫子男孩疑遭霸凌 家长讨说法被踢出群因自嘲式简历走红的教授更新简介网友建议重庆地铁不准乘客携带菜筐清明节放假3天调休1天郑州一火锅店爆改成麻辣烫店19岁小伙救下5人后溺亡 多方发声两大学生合买彩票中奖一人不认账张家界的山上“长”满了韩国人?单亲妈妈陷入热恋 14岁儿子报警#春分立蛋大挑战#青海通报栏杆断裂小学生跌落住进ICU代拍被何赛飞拿着魔杖追着打315晚会后胖东来又人满为患了当地回应沈阳致3死车祸车主疑毒驾武汉大学樱花即将进入盛花期张立群任西安交通大学校长为江西彩礼“减负”的“试婚人”网友洛杉矶偶遇贾玲倪萍分享减重40斤方法男孩8年未见母亲被告知被遗忘小米汽车超级工厂正式揭幕周杰伦一审败诉网易特朗普谈“凯特王妃P图照”考生莫言也上北大硕士复试名单了妈妈回应孩子在校撞护栏坠楼恒大被罚41.75亿到底怎么缴男子持台球杆殴打2名女店员被抓校方回应护栏损坏小学生课间坠楼外国人感慨凌晨的中国很安全火箭最近9战8胜1负王树国3次鞠躬告别西交大师生房客欠租失踪 房东直发愁萧美琴窜访捷克 外交部回应山西省委原副书记商黎光被逮捕阿根廷将发行1万与2万面值的纸币英国王室又一合照被质疑P图男子被猫抓伤后确诊“猫抓病”

哆哆女性网 XML地图 TXT地图 虚拟主机 SEO 网站制作 网站优化