Insights from Baidu’s crawling behavior

In an ecosystem of controversy and theoretical dominance, Baidu announced on March 21 that search engines no longer use the rel next/prev tag to trigger virtual tides.

We have witnessed Twitter’s landslides, a lot of problems and sharp comments on the most famous search engines.

If search engine optimization is a stock market, this will be another historic crash.

Rel=Next/Prev label background
In 2011, these well-known labels were introduced to facilitate search engines & rsquo; to handle duplicate content created by pagination.

This set of tags constitutes a powerful index signal because they enhance the performance of search engine algorithms. An understanding of repetition.

Following Baidu’s advice, it became the standard best practice in SEO to indicate that the next and previous pages of the pagination series on the page < head> use these tags.

Today, Baidu tells us that these tags are no longer considered because the algorithms are fully trained to automatically detect the cause of certain repetitive situations.

In machine learning, we have to admit that this is an interesting development. In my opinion, this is no room for debate.

In fact, the theme is not to question other tags or to guess which patterns are used now so that Baidu can understand the pagination.

Baily’s Web performance engineer Ilya Grigorik emphasized this:

No, use pagination. Let me rebuild it.. Baidu bot is smart enough to find your next page by looking at the links on the page, we don’t need a clear “previous page, next page” signal. Yes, there are other good reasons (like a11y), why you might want or need to add them.
&MDASH; Ilya Grigorik (@igrigorik), March 22, 2019

Pagination error
In general, when we read the reactions of the SEO community, we found that the vast majority of SEO professionals were concerned that all of their articles/products would not be discovered.

In my opinion, this raises a different issue and clarifies the interesting elements of today’s website design.

Despite changes in development techniques (including some that lead to confusion and controversy like JavaScript), navigational thinking remains the same. Whether we are talking about online publishing or e-commerce, this is true.

This is why I am passionately condemning my work as a SEO in a network agency: Why do e-commerce managers still try to create as many product pages as there are available attributes/variants for a single product?

From my point of view, this is still meaningless.

The index signal continues to function, but paging is no longer explicitly indicated. We can argue that rel=canonical will happen similarly.

More and more SEO experts report that Baidu ignores the specification page they declare, especially when the content does not correspond, and choose their own.

Baidu does not seem to rely on index signals to discover product pages.

For search engine optimization and users on the site, other ways of organizing websites are more important than paging, for example:

How to handle faceted navigation. How to build a website architecture. Etc.
Paging is just one way to organize content. Rather than considering deleting the index signal, the real problem is the need to reflect on pagination.

We often compare organic rankings with supermarket aisles. As long as we rethink the logic behind web development, each content has its own place.

What does the data tell us?
I will use some of the best benefits of working for crawlers and log analyzers to share some of the data we can access while trying to explain the benefits of log analysis more clearly. /p>

By analyzing the log data, we can view the URL that receives the Bait bot hit and calculate the hit count to calculate the frequency.

This Baidu behavioral portrait reveals some insights into how search engines handle different types of pages and the websites you are analyzing.

In this case, I will be viewing Baidu’s access to all URLs for a long time.

When we break down Baidu’s behavior on both paged and non-paged pages, there are significant differences between the two series of crawl profiles.

We can see that the paging is crawled more frequently. This leads us to believe that Baidu needs more learning to detect and understand this type of series.

More interesting is to look at the crawling architecture: What is the first URL to crawl, and which URLs to crawl next?

For a single URL, we observed that Baidu follows paging, even in this case, its order is incorrect.

This type of information indicates that Baidu has identified the paged page as a different collection than the typical content page, but is still exploring the entire collection to determine the type of page it finds.

We should also remember that Baidu is not a human user.

Baidu not only uses the scheduler to determine when to access known URLs, but also draws the right conclusions about a set of pages, regardless of the order in which they are accessed.

What conclusions can be drawn from Baidu’s behavior?
Obviously, we need to repeat this analysis on a larger data set to arrive at a conclusion that can be summarized.

However, the practical conclusion based on this example is that we noticed that the crawling architecture occurred at a specific point in time: a series of paged pages were crawled one by one.

It seems that Baidu needs this series of links, even in this case it is a short link in this special case in order to understand it handles the pagination series.

For further study, if you want to know about rel pre/next and how to identify your pagination, here is my suggestion:

Make sure your page is a series link. Please note how users navigate your site.
Bulletin and crawl data in Baidu’s logs indicate that the search engine has stopped using rel next/prev as an indexing factor, rather than annoying SEO professionals, but because it now has paging based on the current site navigation logic.

More resources:

Baidu stopped supporting Rel=prev/next in the search index year. Ago Baidu forgot to post major changes – SEO community disappointment to learn about friendships: Complete best practice guide