Understanding How Search Engines Work: Crawling, Indexing, and Ranking

Alekh Verma
Updated November 29, 2022

As already discussed in the previous chapter, Search engines are basically answering machines that are primarily there to answer the audience’s requests. The primary aim of any search engine is to arrange the online content in a systematic manner and take out the relevant content to answer users’ requests whenever called upon so.

However, your content first needs to be visible to the search engines to show it to the user. In this chapter, we’re going to explore what steps you must take for your content to be visible to search engines. Let’s explore!

Table of Contents

SEO: Significance of Different Search Engines

A lot of people who just enter the world of SEO often wonder about the significance of different search engines. The majority of people are aware that Google dominates the market, but they still wonder how crucial is it to optimize for Bing, Yahoo, and other search engines as well. The truth is, even though there are more than 30 significant web search engines, the SEO industry mostly concentrates on Google.

Now you must wonder why this is so. The simplest response to this question is that the majority of people use Google to search the internet. To quantify this, more than 90% of web searches take place on Google, which is roughly 20 times more than Bing and Yahoo put together.

Now that we’ve established which search engine you need to focus on primarily, let’s figure out how a search engine works.

How Do Search Engines Work?

You need to understand these three main functions to understand how search engines work-

Crawling: This entails searching the internet for content, and examining the code & content of each URL you come across.
Indexing: Once the crawling is done, you arrange and store the gathered information. So that the indexed page has the chance to be returned as a response to a relevant search.
Ranking: Finally, the results are ranked from most relevant to least relevant in order to deliver the content that will best satisfy a searcher’s inquiry.

1. Explaining Search Engine Crawling

Crawling is the discovery process that search engines use to send out robots crawlers to look for new and updated content. Your links are what allows the user to find your content no matter what format the content is in whether it’s a webpage, a PDF, an image, or a video.

First Googlebot fetches a few web pages to find new URLs and then follows the link to those results. The crawler is able to find new content by following this network of links, adding it to their Caffeine index, a sizable database of found URLs, to be later retrieved by searchers looking for information that the content on that URL is a good match for.

2. Explaining Search Engine Index

An index is a sizeable database of all the content that search engines have found and deemed suitable for serving the users. In short, all the relevant content is processed and stored in the index.

3. Explaining Search Engine Ranking

The main aim of the search is to answer the queries of users. They accomplish this by mining their index for stuff that is relevant to a user’s search, organizing that content, and then trying to answer the user’s question. Now, the term ranking refers to the process of arranging/ranking the search results according to their relevancy to the user’s query. In simple terms, the ranking of a page on search engines represents how relevant the search engine considers the page to a user’s query.

Search engine crawlers can’t access parts or all of your websites if you advise search engines to take specific pages off of their index. You must have valid justifications for doing so. However, if you want your content to be visible on search engine results pages (SERPs), then you need to ensure it is crawlable and indexable. If you fail to do so then your content is effectively invisible.

Our aim with this chapter is to make you understand how search engines work, and how you can use crawling, indexing, and ranking in the favor of your websites, instead of against them.

Crawling

How Your Pages Can Be Found by Search Engines.

As you’ve just learned, in order for your site to appear in the SERPs, it must be crawled and indexed. If you already have a website, it could be a good idea to see how many of your pages are included in the index. If Google is crawling and locating all of the pages you want it to and none of the pages you don\’t, this will provide you with important information about that.

The advanced search operator “site:yourdomain.com” can be used to examine your indexed pages. Enter “site:yourdomain.com” into the search bar on Google. Doing so would provide you with the results that Google has for your site in its index. You may get a fair idea of which pages on your site are indexed and how they currently appear in search results by looking at the number of results Google displays (see “About XX results” above).

For more accurate results, use and keep an eye on the Index Coverage data in Google Search Console. You can open a free Google Search Console account if you don’t already have one. Using this tool, you may submit sitemaps for your website and keep track of the number of submitted pages that have actually been indexed by Google.

You might not appear in any of the search results for a number of reasons:

Since your website is new, it hasn’t been crawled yet.
Your website has no connections to any external websites.
Your website’s navigation makes it difficult for a robot to successfully crawl it.
Your website’s simple code, known as crawler directives, prevents search engines from indexing it.
Google penalized your website for employing spamming strategies.

How You Can Instruct Search Engines to Crawl Your Website

If you used Google Search Console or the “site:domain.com” advanced search operator and discovered that some of your important pages are missing from the index and some of your unimportant pages have been mistakenly indexed, there are some optimizations you can apply to better instruct Googlebot how you want your web content to be crawled. By telling search engines how to crawl your website, you might be able to exert more control over what shows up there.

Most people think about making sure Google can find their important pages, but it’s easy to forget that there are certainly some pages you don’t want Googlebot to find. Examples include old URLs with little information, duplicate URLs (like sort-and-filter criteria for e-commerce), specific promo code pages, staging or test pages, and so on.

To restrict Googlebot from accessing specific pages and regions of your website, use robots.txt.

Robots.txt

Robots.txt files are found in the root directory of websites (for example, yourdomain.com/robots.txt), and they provide instructions for the precise areas of your site that search engines should and shouldn’t crawl as well as the rate at which they should do so.

How robots.txt files are handled by Googlebot

In the absence of a robots.txt file, Googlebot continues to crawl the website.
In most cases, Googlebot will follow the instructions in a robots.txt file and continue to crawl the website if one is there.
Googlebot won\’t crawl a site if it meets a problem trying to read the robots.txt file and is unable to establish whether one is present or not.

Not every online robot adheres to robots.txt. People that intend harm (such as email address scrapers) create bots that disregard this protocol. In reality, some malicious individuals search for your private stuff using robots.txt files. Although it may make sense to prevent crawlers from accessing private pages, such as login and administration pages, in order to prevent them from appearing in the index, doing so also makes it easier for those with malicious intent to locate those URLs. Instead of adding these pages to your robots.txt file, it is preferable to NoIndex them and locks them behind a login form.

GSC URL Parameter Definition

By adding specific characteristics to URLs, some websites (most frequently those that deal with e-commerce) make the same content accessible on numerous alternative URLs. If you’ve ever done any online shopping, it\’s probable that you used filters to focus your search. For instance, you may look up “shoes” on Amazon and then narrow down your results by style, color, and size. The URL slightly varies after each refinement.

How does Google decide which URL to show to users when they search? Google does a decent job of determining the representative URL on its own, but you can utilize Google Search Console’s URL Parameters function to specify exactly how you want Google to interpret your pages. By instructing Googlebot to “crawl no URLs with __ parameter” using this function, you\’re effectively requesting that it ignore this content, which could lead to the removal of certain pages from search results. If those criteria result in duplicate pages, that is what you want, but it is not ideal if you want those pages to be indexed.

Can Crawlers Access All of Your Key Content?

Let\’s learn about the optimizations that can help Googlebot identify your significant sites now that you are aware of some strategies for keeping search engine crawlers away from your unimportant material. Parts of your website may occasionally be crawled by search engines, while other pages or portions may be hidden for a variety of reasons. Making sure search engines can find all the information you want to be indexed, not just your homepage, is crucial.

Ask yourself these questions:

Is your material protected by login pages?

Search engines won\’t index pages that need users to log in, complete forms, or respond to surveys in order to access particular material. There is no way a crawler would log in.

Do you frequently use search forms?

Search forms cannot be used by robots. Some people think that if they add a search box to their website, search engines will be able to find anything that their website visitors look for.

Does non-text content contain any hidden text?

Text that you want to be indexed shouldn’t be displayed in non-text media types like photos, videos, GIFs, etc. There\’s no guarantee that search engines will be able to read and interpret it just yet, even if they are growing better at identifying photos. It’s always ideal to incorporate text inside of your webpage’s HTML markup.

Can search engines navigate your website?

A crawler must be led from page to page by a path of links on your own site, just as it must find your site via links from other sites. A page that doesn’t have any links going to it from other pages but that you want search engines to find is essentially invisible. A common mistake that many websites do that makes it challenging for them to appear in search results is organizing their navigation in a way that is confusing to search engines.

Common navigational errors that may prevent crawlers from seeing your entire site include:

A mobile navigation system that displays different results from your PC navigation
Any navigation style that uses JavaScript to enable menu items which are not HTML-based. Although Google has significantly improved its ability to crawl and comprehend JavaScript, the technique is still far from ideal. Including something in HTML is the most reliable approach to make sure it is read, comprehended, and indexed by Google.
To a search engine crawler, personalization or providing customized navigation to a particular type of visitor as opposed to others may look like cloaking.
Forgetting to include a link to a key page in your navigation – keep in mind that links are the routes web crawlers take to find new content!

Because of this, it\’s crucial that your website features easy-to-navigate pages and useful URL folder structures.

Is your information architecture clear?

Information architecture is the process of organizing and categorizing content on a website to improve user efficiency and findability. Users can easily navigate your website and find the information they require with the best information architecture.

Do you make use of sitemaps?

The list of URLs on your website that crawlers can use to find and index your content is what a sitemap is exactly what it sounds like. Making a file that complies with Google’s requirements and submitting it through Google Search Console is one of the simplest ways to be sure that Google is finding your most important pages. Although posting a sitemap won’t replace effective site navigation, it can unquestionably help crawlers locate all of your important content.

Even if no other websites link to your website, you might be able to get it indexed by Google Search Console by uploading your XML sitemap. It’s worth a shot, even if there is no guarantee that they will add a provided URL to their index.

When Crawlers Attempt to Access Your URLs, do they Encounter Errors?

Crawlers may run into issues while trying to crawl the URLs on your website. Visit the “Crawl Failures” report in Google Search Console to identify URLs where this might be occurring. You can see server errors and not found issues in this report. This and a wealth of other data, like crawl frequency, can be found in server log files, but because accessing and analyzing server log files is a more specialized technique, we won’t get into it in detail in the Beginner’s Guide.

It\’s critical to comprehend server problems and “not found” issues before you can take any action with the crawl error report that is significant.

Search engine crawlers are unable to access your content because of 4xx codes.

These errors are caused by client faults, which signify the requested URL has incorrect syntax or cannot be processed. One of the most common 4xx problems is the “404 – not found” issue. These could occur as a result of a typographical error in the URL, the deletion of a page, or a botched redirect, to name a few. When they receive a 404 error, search engines cannot access a URL. A 404 page may cause users to lose patience and leave.

When a server fault prevents search engine crawlers from accessing your content, they display the 5xx code.

The 5xx errors are server errors, which means that the web page’s host failed to process the request from the user or search engine to access the page. There is a tab specifically for these mistakes in the “Crawl Error” report from Google Search Console. These frequently happen because Googlebot gave up on a timed-out URL request. Consult Google’s website for more details on fixing server connectivity issues.

The 301 (permanent) redirect is a thankfully effective technique to inform both searchers and search engines that your website has relocated. For example, suppose you switch a page from example.com/young-dogs/ to example.com/puppies/. Users and search engines alike require a bridge to connect the previous and new URLs. A 301 redirect is used on that bridge.

When do you implement a 301:

Link Equity: transfers link value from the old URL of the page to the new one.
Indexing: aids Google in finding and indexing the updated page.
User Experience (UX): makes sure users can find the page they need.

When you don’t implement a 301:

Link Equity: The authority from the previous URL is not transferred to the new version of the URL without a 301.
Indexing: The mere existence of 404 errors on your website has no negative effects on search engine performance, but allowing highly ranked and frequented sites to fail can cause their removal from the index, along with their traffic and rankings.
User Experience (UX): It can be frustrating for your visitors to land on error pages when you let them click on dead links that should have taken them to the desired page.

Avoid rerouting URLs to irrelevant pages, or URLs where the content of the old URL doesn\’t actually reside, as the 301 status code itself indicates that the page has been permanently transferred to a new address. A page that is already ranked for a query may lose its rating if you 301 it to a URL with different content since the content that made the page relevant for that specific query is no longer present. 301s are effective; move URLs carefully.

The 302-redirect option is also available, but it should only be used for short-term transfers and when passing link equity isn’t a major issue. 302s resemble a road detour in certain ways. You are temporarily diverting traffic through a specific path, but that situation won’t last.

The next step is to make sure your site can be indexed after making sure it is optimized for crawlability.

Indexing

How Are Your Pages Interpreted and Stored by Search Engines?

Once you’ve established that your site has been crawled, the following step is to ensure it can be indexed. That’s right; a search engine’s ability to locate and crawl your website does not guarantee that it will be added to its index. The earlier section, where we discussed how search engines identify the pages on your site, covered crawling. The index contains the pages you have left unread. Once a crawler finds a page, the search engine presents it similarly to how a browser would. While doing this, the search engine evaluates the data on that page. All of that information is in the file’s index.

Continue reading to find out more about indexing and how to ensure that your website appears in this crucial database.

Can I access a Google crawler’s view of my sites?

Yes, the cached version of your page contains a snapshot of the most current Googlebot crawl. Google caches and visits websites on a sporadic basis. You can view your cached version of a page by choosing “Cached” from the drop-down menu next to the URL in the SERP.

You may also check your website’s text-only version to see if your crucial material is being properly indexed and cached.

Are Any Pages Ever Taken Out of the Index?

Yes, you can take pages out of the index. Among the key explanations for why a URL might be taken down are:

A “not found” (4XX) or “server error” (5XX) is being returned by the URL. It’s possible that this was done on purpose or accidentally (the page was moved without a 301 redirect being set up) (the page was deleted and 404ed in order to get it removed from the index)
The URL had a noindex meta tag added. Site owners can use this tag to tell search engines to exclude a page from their index.
The URL was manually deindexed after being penalized for going against the search engine’s Webmaster Guidelines.
With the addition of a password requirement, the URL has been made crawl-blocking-unfriendly.

You can use the URL Inspection tool to find out whether a page on your website that was previously indexed by Google is still visible, or you can utilize Fetch as Google, which offers a “Request Indexing” capability, to add specific URLs to the index. (Added bonus: The “render” option in GSC’s “fetch” tool lets you check if there are any problems with the way Google understands your website.)

Describe How to Index Your Website for Search Engines.

Meta-directives for robots

You can provide search engines instructions on how to handle your web page by using meta directives (also known as “meta tags”). Such instructions as “don’t index this page in search results” and “don’t pass any link equity to any on-page links” can be sent to search engine crawlers. These directives are carried out by either the X-Robots-Tag in the HTTP header or the Robots Meta Tags, which are most frequently used, in the head> of your HTML pages.

1. Robots Meta Tag

In the HTML code of your website, you can utilize the robots meta tag. It may only exclude a few search engines or all of them. Here is a collection of the most used meta directives, along with instances of when you might use them.

Index/Noindex: Whether a page should be crawled and stored in a search engine’s index for retrieval is indicated by the index/noindex tag. By choosing to utilize “noindex,” you are telling crawlers that the page should not appear in search results. It is not necessary to use the “index” parameter because search engines by default believe they can index all pages.

When it may be used: If you want to remove thin pages from Google’s index of your site (such as user-generated profile pages) but still want them to be available to visitors, you may choose to mark a page as “noindex.”

Follow/Nofollow: Search engines can be instructed to follow or not follow links on a page using the follow/nofollow attribute. By selecting “Follow,” bots will follow the links on your page and give link equity to those URLs. The search engines will not follow or pass any link equity to the links on the page if you choose to use the “nofollow” attribute. All pages are presumptively presumed to have the “follow” property by default.

When to use: When trying to prevent a page from being indexed as well as prevent the crawler from following links on the page, nofollow is frequently used in conjunction with noindex.

noarchive: The noarchive directive prevents search engines from caching a copy of the page. Searchers can access viewable copies of every page that has been indexed by the engines by default through the cached link in the search results.

When to use: The noarchive tag can be used to stop searchers from seeing out-of-date pricing if you run an e-commerce site and your prices fluctuate frequently.

2. X-Robots-Tag

If you want to block search engines at scale, the x-robots tag is used within the HTTP header of your URL. It offers greater power and versatility than meta tags because it allows you to utilize regular expressions, block non-HTML files, and apply sitewide noindex tags.

You may avoid the main traps that might stop your key pages from being found by being aware of the various ways you can affect crawling and indexing.

Ranking

How Do URLs Rank in Search Engines?

How do search engines make sure users who enter a query into the search field receive accurate results? The ranking is the practice of arranging search results from most relevant to least relevant for a given query.

Search engines use algorithms, which are a method or technique for retrieving and meaningfully ordering stored information, to evaluate relevance. To enhance the caliber of search results, these algorithms have undergone numerous revisions over time. Google, for instance, modifies its algorithms daily; some of these updates are tiny quality improvements, while others are core/broad algorithm updates implemented to address a particular problem, such as Penguin to address link spam. For a collection of both proven and unconfirmed Google modifications dating back to the year 2000, visit our Google Algorithm Change History.

Why is the algorithm updated so frequently? We do know that Google’s goal when making algorithm tweaks is to increase overall search quality, despite the fact that Google doesn’t always provide explanations as to why they do what they do. Google will typically respond to inquiries about algorithm upgrades by saying something along the lines of, “We’re making quality updates all the time.” This means that if your site suffered as a result of an algorithm change, you should compare it to Google’s Quality Guidelines or Search Quality Rater Guidelines, both of which are very indicative of what search engines value.

The Goals of Search Engines

The goal of search engines has always been to deliver relevant responses to users’ queries in the most helpful manner. If that’s the case, why does it seem like SEO has changed from earlier years?

Their initial comprehension of the language is fairly basic: “See Spot Run.” They gradually develop a deeper understanding and learn semantics, which is the study of language meaning and the connections between words and sentences. With enough repetition, the student eventually gains fluency in the language and is able to respond to even ambiguous or incomplete queries.

When search engines were just beginning to understand human language, it was much easier to game the system by employing techniques and strategies that really go against quality standards. Use the example of “stuffing” to illustrate. If you want to raise your ranking for a particular term, like “funny jokes,” you may add the word “funny jokes” repeatedly to your page and make it bold.

Instead of laughing at amusing jokes, individuals were instead inundated with irritating, difficult-to-read material as a result of this technique. Although it might have been successful in the past, search engines have never desired this.

The Role Links Play in SEO

When we talk about links, there are two alternative interpretations. While backlinks, also referred to as “inbound links,” are links from other websites pointing to your website, internal links are links within your own website that point to other pages (on the same site).

Early on, in order to select how to rank search results, search engines needed help figuring out which URLs were more trustworthy than others. The number of links pointing to each site was counted to achieve this.

Backlinks function very similarly to actual Word-of-Mouth (WoM) referrals.

Getting a referral from others- It is a good sign of authority.
Getting a referral from yourself- This showcases an element of bias and shows a bad sign of authority.
Getting referrals from low-quality or irrelevant sources- It can get you flagged for spam and is not a good sign of authority.
Getting no referrals at all- Is not a good sign of authority.

This is why PageRank was created. One of Google’s founders, Larry Page, is honored with the name of a link analysis algorithm that is a part of Google’s core algorithm. PageRank assesses the importance of a web page by analyzing the quality and number of links pointing to it. A website is thought to have more links if it is more important, pertinent, and reliable.

Having more natural back-links from high-authority (trusted) websites increases your chances of appearing higher in search results.

The Function of Content in SEO

Links would be useless if they didn’t point searchers somewhere. That thing is satisfied! Content is anything intended for searchers to consume, including text, images, video, and other types of content. If search engines are question-and-answer computers, then the content is how the engines provide those answers.

How do search engines determine which pages the searcher will find useful when there are hundreds of possible returns for a given query? Where your page will rank for a given query will be significantly influenced by how well the information on your page matches the goal of the query. In other words, does this website support the user’s desired outcome and match their search terms?

Because the emphasis is on user satisfaction and task completion, there are no strict rules on how long your content should be, how many times it should contain a keyword, or what you should write in your header tags. Although all of those elements may have an impact on a website\’s search engine optimization, the users who will actually be reading the content should come first.

Even though there are now hundreds or even thousands of ranking factors, the top three have mostly remained the same: Rank Brain, on-page content (high-quality information that satisfies a searcher\’s purpose), and connections to your website, which act as third-party credibility signals.

What is RankBrain?

RankBrain is the name of the machine learning component of Google’s core algorithm. Machine learning is a form of a computer program that continuously enhances its predictions over time through new observations and training data. In other words, it never stops learning, and because of this, search results should be improving over time.

You can guarantee that RankBrain will change those results, elevating the more relevant result and degrading the less relevant pages as a byproduct, for instance, if it discovers that a lower-ranking URL is giving users a better result than the higher-ranking URLs.

In what ways does this affect SEOs?

We must concentrate more than ever on satisfying searcher intent because Google will continue to employ RankBrain to highlight the most pertinent, helpful content. You’ve made a significant first step toward succeeding in a RankBrain environment if you give searchers who might land on your page the best information and experience possible.

Metrics of Engagement: Correlation, Cause, or Both?

Engagement indicators for Google rankings are probably a combination of correlation and causation. When we talk about engagement metrics, we’re talking about data that demonstrates how visitors to your site who arrived via search results engage with it. This comprises items like:

Clicks (visits from search) (visits from search)
Duration on the page (amount of time the visitor spent on a page before leaving it)
Bouncing rate (the percentage of all website sessions where users viewed only one page)
Pogo-sticking (clicking on an organic result and then fast returning to the SERP to choose another result) (clicking on an organic result and then quickly returning to the SERP to choose another result)

Google’s Stance on This

Google has been explicit that they clearly use click data to adjust the SERP for specific queries, although never using the term “direct ranking signal.”

It would seem that Google stops short of calling engagement metrics a “ranking signal” because those metrics are used to improve search quality, and the rank of specific URLs is just a byproduct of that. However, it would seem that engagement metrics are more than correlation because Google needs to maintain and improve search quality, making it seem inevitable that they are more than correlation.

The Change in Search Results

The phrase “10 blue links” was created to characterize the SERP’s flat layout back when search engines had a lot of the sophistication they do today. Google would always deliver a page with 10 identically formatted organic results after a search. The coveted #1 position in this search environment represented the pinnacle of SEO. Then, though, something happened. Google started introducing SERP features, or additional forms for results, on their search result pages. These SERP properties include, among others: Paid Ads, Knowledge Panel, a Local (map) Pack, Featured Snippets, Sitelinks, and People Also Ask Boxes.

And Google continually adds new ones. Even “zero-result SERPs” were tested, where only one Knowledge Graph result was shown on the SERP and there were no results below it other than a “see more results” option. For two main reasons, the addition of these functionalities initially raised some eyebrows. One of the effects of several of these features was to further drive organic results down on the SERP. Since more searches are being addressed on the SERP directly, another outcome is that fewer searchers are clicking on the organic results.

So why would Google act in this way? Everything comes back to the search process. According to user behavior, some inquiries are better satiated by particular content types. Observe how the various SERP feature kinds correspond to the various query intent categories. There are a lot of factors that influence your content being ranked on SERPs, but you need to pay special attention to the structure if you want it to be crawled, indexed, and ranked.

We’re going to discuss it more in detail in Chapter 3.

Introducing RANKS PRO
Take Control of Your SEO Now!