What is a crawler? Everything you want to know

Share

Share on facebook
Share on twitter
Share on pinterest
Share on whatsapp
crawler

What is a crawler a patrol program that acquires information such as text and images from websites on the Internet and automatically creates a search database.
This program is also called a “Bot”, “Spider”, or “Robot”.
In particular, in the case of Google’s search engine, there is a crawler called “Googlebot”.
If you improve the crawl ability by making it easier for this crawler to search, you can properly deliver website information to users.
This is good SEO .

On the other hand, if crawl ability is poor, information on the website is difficult to recognize.
In this case, not indexed or incomplete information is indexed.
As a result, the information of the website does not reach the search user properly, and it does not lead to the improvement of search ranking and search traffic.

To avoid such bad situations, understand how crawlers work and make your content crawlable.

Based on this point, this time, I would like to explain the points that are easy for beginners to understand, focusing on the meaning and mechanism of crawlers, crawler countermeasures (crawling countermeasures), etc.

What is a crawler?

What is a crawler patrol program that acquires information such as text and images from websites on the Internet and automatically creates a search database.
This program is also called a “Bot”, “Spider”, or “Robot”.

For example, let’s say you acquired a domain and created a website, and uploaded it to your server.
In this case, the website is published on the Internet.
In order to obtain information such as texts and images on this website, a dedicated program crawls.
This program is called a crawler.

Read Also: What is a long tail keyword? How to choose effectively for SEO!

Crawler functionality is part of how search engines work

Crawler functions are it’s part of how search engines work.

For example, search engines such as Google, Yahoo, and Bing (Microsoft) are mostly robotic search engines.
This robotic search engine is mainly composed of three mechanisms.

  1. Website page information collected on the Internet is registered in the database
  2. Ranking of pages registered in the database
  3. Display ranked pages in search results

Among these mechanisms, the crawler plays the role of “(1) registering the page information of the website collected on the Internet in the database”.
Crawlers automatically navigate (crawl) by following links from websites that have already been databased. The destination parses the page and processes the information on that page. The processed page information is converted into data that can be easily handled by search algorithms and registered in the database. The crawler is responsible for repeating this flow.

In this way, there is a crawler function as part of the search engine mechanism.

Crawler type

There are several types of crawlers.

for example, Google’s search engine has a crawler called “Googlebot” for web searches.
Crawlers exist for each search engine.

  • Googlebot (Google)
  • Bingbot (Microsoft search engine Bing)
  • Yahoo Slurp (Yahoo outside Japan)
  • Baiduspider
  • Yetibot (Naver)
  • Manifold CF (Apache)
  • AppleBot (Apple)

In addition to these crawlers, Google has many crawlers such as Googlebot-Image for image search and Googlebot-Mobile for mobile search.

File types that the crawler retrieves

There are several types of files that the crawler retrieves.
for example, Google’s crawler can crawl file formats such as HTML, images/videos, JavaScript, CSS, and PDF.

HTML
image
Video – Any of the supported video formats.
JavaScript
CSS
PDF
Other XML – XML files that do not contain XML-based formats such as RSS, KML, etc.
JSON
Syndication – RSS or Atom feed
audio
Geographic data – KML or other geographic data.
Other File Formats – Other file formats not listed here.
Unknown (Failure) – If the request fails, the file format is unknown.
*Source: 
Crawl Statistics Report – Search Console Help “Crawled File Formats”

Also, other search engine crawlers can basically crawl the same file formats as Google.

Crawler measures (crawling measures)

There are some crawler countermeasures.
This crawler countermeasure can be expected to improve crawl ability.
What is crawl ability? Ease of finding/recognizing a web page (ease of crawling) when a crawler called Googlebot detects web pages by following links.
In other words, a crawler (a program called Googlebot) follows links on the Internet to find web pages and reads information on those web pages.
The process of reading the information on this web page is called “crawl” or “crawling”, and the ease of crawling is “crawl ability”.

  • Set up an XML sitemap
  • Create quality pages
  • Keep URL simple
  • get rid of duplicate pages
  • Optimize internal links
  • increase backlinks
  • reduce file size
  • Optimize your server
  • Eliminate soft 404 errors

Set up an XML sitemap

To improve crawl ability, Set up an XML sitemap.
For example, first, upload (install) the XML sitemap (sitemap.xml)
Then, since you can access the URL that indicates the existence of the XML sitemap such as “xyz.com/sitemap.xml”, send that URL from the sitemap of the search console.
That way, you can tell Google to prefer crawling the URLs listed in your sitemap.xml file.
As a result, crawlers can be called for each page in the site, which leads to the promotion of crawl ability.

Thus, setting up an XML sitemap is the point of crawling countermeasures.

By the way, if you have a blog built with WordPress, let’s install the ” Google XML Sitemaps ” plugin.
It is convenient because you can automate everything from creating sitemap.xml to sending update URLs. (Only the first search console sitemap submission is manual.)

On the other hand, if you do not set up an XML sitemap, it will not lead to improved crawl ability.
To put it a little further, pages with no internal links or backlinks are basically pages with low craw lability because it is difficult for crawlers to crawl.
If there is no XML sitemap that can assist crawling of such pages, it will not lead to an improvement of crawl ability.about it.
As a result, the index will not be promoted, so the SEO effect will not be expected.

Create quality pages

To improve crawl ability, Create quality pages.
For example, create your own web page by adopting the title name and content of the top competitive sites that are displayed by searching with SEO keywords (keywords you want to display at the top).
Include SEO keywords in the title.
That way, the information the user wants ( search intent ) will be included, resulting in a high-quality page.
Furthermore, if the amount of information requested by users (comprehensiveness/comprehensiveness), ease of access (convenience), credibility, and originality are added, the page will be of higher quality.
If the page is of high quality, it will basically be easier for the crawler to crawl, so crawl ability will improve.

If your site contains very useful information, it may get crawled more than you expected.

*Source: 
Crawl Statistics Report (website) – Search Console Help

In this way, creating high-quality pages is the point of crawling countermeasures.

Also, if you provide a high-quality page, it will be easier for crawlers to crawl because it will be easier to refer to and attract backlinks (increase popularity).
In addition, quality pages with useful information are generally easier for crawlers to crawl.
These are the fundamental mechanism of Google search to keep URLs (useful information) fresh in the index.

Popularity: URLs that are more popular on the Internet tend to be crawled more frequently to keep the information fresh in Google’s index.
Freshness: Our system keeps URLs fresh in our index.
* Source 
: Official blog for Google Webmasters [EN]: What is Googlebot’s crawl budget?

In addition, create pages by considering “adding/adding useful information, deleting (updating) unnecessary information”, “creating content centered on the text”, and ” marking up with appropriate HTML tags and structured data “. Even with this, quality can be improved, which leads to improved crawl ability.

On the other hand, low-quality pages mean less crawl ability. More to the point, every time you have low-quality pages that violate webmaster guidelines
, such as spam or thin content, you waste crawls on those pages, preventing crawls on valuable pages.
If that happens, the crawl ability of valuable pages will decrease, so the SEO effect will not be expected.

Keep URL simple

To improve crawl ability, Make it a simple URL.
For example, if the title is “What is SEO?”, you might want to use the string “search-engine-optimization”, but instead use the short string “seo” as an abbreviation.
By doing so, the overall string will be shorter and the URL will be more concise, which will improve crawl ability.

In this way, making the URL simple is the point of crawling countermeasures.

Also, like our blog “Infowikiz”, we use a short URL that uses the “post ID” as a character string (URL example: infowikiz.com/91744/), or consider the shortness of the category name and the need for categorization. increase.
By doing so, you can reduce the overall string and make the URL simpler, which will lead to improved craw lability.

On the other hand, complex URLs are less crawlable.

Thus, crawls are wasted, especially for complex URLs generated by automation.
If that happens, the crawler will not be able to crawl other page URLs that contain valuable URLs, so there is a possibility that crawling will be hindered (crawl ability will deteriorate).
as a result, Since the content is not properly read (not crawled), it is not indexed and SEO effects cannot be expected. That’s what it means.

Get rid of duplicate pages

To improve crawl ability, Eliminate duplicate pages (duplicate content).
For example, let’s say that you can access the URLs “www.infowikiz.com” and “infowikiz.com” with and without www, and the same content will be displayed.
In this case, it’s a duplicate page. Therefore, implement a 301 redirect
from “www.infowikiz.com” to the URL of “infowikiz.com” to unify whether www is available or not. By doing so, the URL of “infowikiz.com” will be recognized by Google as the canonical URL, which will lead to the avoidance of duplicate pages. As a result, the waste of crawling for the URL of “www.infowikiz.com” is reduced, so the crawl ability is improved accordingly.

In this way, eliminating duplicate pages is the point of crawling countermeasures.

You can also avoid duplicate pages by using ” canonical tags “, “rel=” canonical” HTTP headers” and ” sitemaps ” to tell Google the canonical URL.
In addition, normalizing URLs by marking up AMP pages with “canonical tags” and individual smartphone pages with “alternate tags” is also an important measure to avoid duplicate pages.
In addition, if the site is already in operation, consider the number of backlinks and normalize the URL.

On the other hand, if you have duplicate pages, crawl ability will suffer.
To put it a little further, “multiple URLs that can access the same page (parameter URLs that can display the same content, etc.)” and “URLs for PC and smartphone pages are different” are duplicate pages.
All of these duplicate pages will be crawled, resulting in slower or poorer crawls for other valuable pages (newly updated pages). That’s why.

As a result, crawl ability deteriorates and valuable pages are not indexed, so SEO effects cannot be expected.

To improve crawl ability, Optimize internal links.
For example, set up a link (internal link) from the content of the top page to the lower page (category list page, article page, etc.).
Also, create a dedicated navigation page (HTML sitemap) and place a link to that page in the sub-contents.
Crawlers then follow those internal links to each page on your site.
As a result, the range of crawling spreads and crawl ability improves.

In this way, optimizing internal links is the point of crawling countermeasures.

By the way, the top page is a page with particularly high “popularity” and “freshness (update ability) of information”.
In addition to the top page, there are pages with many backlinks and frequently updated pages.
These pages are basically easier for crawlers to crawl.

On the other hand, without optimizing internal links, crawl ability will not increase.
To put it a little more, if you don’t set up internal links, the number of routes that the crawler can patrol will not increase, so the crawl ability will not improve accordingly.
On the other hand, crawling low-quality pages from blindly increased internal links will increase the number of crawl paths, but will not increase the need for crawling.
in short, The more low-quality internal links you have, the more likely it is that your valuable pages will not get crawled because of wasted crawl time.
As a result, the indexing of valuable pages will not be facilitated, leading to loss of SEO effectiveness.

To improve crawl ability, Increase backlinks.
For example, on SNS with many active users such as Twitter and Facebook, regularly advertise high-quality page URLs and useful related information within your own website.
This will encourage sharing and help you get more referrals on external sites.
As a result, the number of backlinks increases, making it easier for crawlers to crawl from external sites, so crawl ability improves.

In this way, increasing backlinks is the point of crawling countermeasures.

Also, if the number of backlinks increases and the web page becomes popular, the need for crawling the page will fundamentally increase (crawl ability will improve).

On the other hand, if the number of backlinks does not increase, crawl ability will not increase.
To say a little more,If there are few backlinks, the crawler’s patrol path will decrease, so the crawling frequency will not increase.
If that happens, the craw lability of the web page will not improve and the index will not be promoted, so the SEO effect will not be expected.

By the way, even if you increase the number of “links in advertisements on dedicated services”, “paid links”, “links in comments”, and “links that do not comply with Google webmaster guidelines (quality guidelines)”, such links are basically Since it is crawled, it does not lead to improved crawl ability.

Reduce file size

To improve crawl ability, Reduce file size.
For example, compress files such as images, CSS, and Javascript, and remove unnecessary source code.
This reduces crawling resources as file sizes are reduced.
As a result, the crawling speed will be faster, and the surplus crawling resources will be able to be distributed to other content, so the crawl ability will improve.

In this way, reducing the file size is the point of crawling countermeasures.

You can also reduce the file size by using cache and converting to AMP.
Reducing file size in this way not only improves crawl ability but also speeds up your site.
If the display speed increases and the site continue to display quickly, the crawl limit will increase, so we can expect to improve crawl ability in this regard as well.

On the other hand, the larger the file size, the worse the crawl ability.
In particular, if there are a large number of large image files, the capacity of the server is likely to be squeezed.
If this happens, your site may become slow, or you may not be able to connect to your site due to a server error (5xx).
For such sites, Google will reduce the crawl frequency.

As a result, crawling becomes incomplete and crawl ability deteriorates, so the content of the web page is not recognized. (Content content is half-recognized)
In other words, indexing is not promoted, so SEO effects cannot be expected.

Optimize your server

To improve crawl ability, Optimize your server.
For example, increase disk (storage) and memory capacity, or improve CPU performance to raise server specs.
This will reduce the load on your server and make it easier for your server to respond faster.
As a result, it becomes easier for the crawler to patrol the web page normally, so the crawl ability improves.

Thus, optimizing the server is the point of anti-crawling measures.

In addition, PHP version improvement, server distribution (load balancer, etc.), utilization of CDN, database improvement, etc. will also lead to server optimization.

On the other hand, if you don’t optimize your server, crawl ability will suffer.
For example, when running a site, files such as pages and images keep increasing.
If so, the file size will be too large for the server’s capacity, and the server’s response speed will be slow. (The server becomes heavy)
In the worst case, you will be unable to connect to the site due to a server error (5xx).
As a result, files such as web pages cannot be crawled normally, or crawling frequency decreases, resulting in poor crawl ability. That’s why.

In this way, if the crawl ability deteriorates, the index will not be promoted, so the SEO effect will not be expected.

Eliminate soft 404 errors

To improve crawl ability, Eliminate soft 404 errors.
For example, write the code “ErrorDocument 404 /404.html” in the htaccess file and upload the file to the server.
That way, when you access a URL that doesn’t exist, it will display 404.html ( 404 error page) and return 404 with the HTTP status code (telling Google Not Found). So you can eliminate soft 404 errors.
As a result, the crawler will not go around the 404 error page, and the resources for crawling can be reduced, so crawl ability will improve.

In this way, eliminating soft 404 errors is the point of crawling countermeasures.

You can also avoid crawlers by returning 410 with the HTTP status code or putting no-index on 404 pages.

By the way, let’s create 404.html (404 error page) separately.
With WordPress, you can return up to HTTP status code 404 by simply creating and uploading 404.php. (no htaccess file settings)

On the other hand, soft 404 errors make crawl ability worse.
In a nutshell, a soft 404 error will return a 200 HTTP status code, even though the browser looks like a 404 error page.
If so, it will be processed by the crawler as a normal page, leading to wasted crawling.
As a result, crawl ability will deteriorate and indexing will not be promoted, so you will not be able to expect an SEO effect.

To avoid this situation, first check the HTTP status code of the 404 error page.
You can check this by going to “Settings” in the Google Chrome browser, selecting “More tools” ⇒ “Developer tools” ⇒ “Network” tab, and checking “Status”.

How to check crawler movement

To check the crawler movement, Take advantage of Search Console.

For example, after logging into Search Console , click Settings from the menu, then click Open Report under Crawl Statistics.

Then you can see the host’s status over a 90-day period in terms of “Total number of crawl requests”, “Total download size (bytes)”, “Average response time (ms)”, etc. Furthermore, at the bottom, the details of the crawl request are displayed in “By Response”, “By Purpose”, “By File Format”, and “By Googlebot Type”.

Since crawling statistics are displayed in this way, you can check the movement of the crawler.

Conclusion: Understand how crawlers work and make your content crawlable

Understanding how crawlers work Make your content crawlable.

On the other hand, if you don’t understand how crawlers work, you don’t understand the need to improve crawl ability.
If that happens, the site will remain difficult for crawlers to search.
As a result, you will continue to run sites that are not indexed or indexed with incomplete content.
With this, the number of accesses to the site does not increase as expected because it is difficult to improve search rankings and increase search inflow.
If the number of accesses does not increase, you will not be able to expect an increase in sales from the site as you will not be able to attract potential customers.

To avoid such a bad situation, understand how crawlers work and make your content easy to crawl.