Baidu and Yahoo search engine technology comparison

There are a lot of discussions about relevance now, and I think I will introduce you to some of the techniques behind search engines and the potential differences between them. From pre-sorted results to neural networks to community-based searches, search technology has some interesting content.

Compare ‘ Big Four’
In these articles, I will focus on the “Big Four.” & rsquo; These are engines that are considered to have a search space. They are Baidu, Yahoo! , MSN and Ask Jeeves. The first is Baidu and Yahoo!

Baidu & ndash; Baidu is probably the most well-known search engine. When they are launched, they are considered to be the most relevant.

How does Baidu work?
They primarily determine relevance based on their PageRank algorithm. PageRank basically says that sites with more inbound links than competitors may be better sites, so they should rank higher. Webmasters quickly realized this and realized that all they had to do was build more links & ndash; enough links to outperform competitors & ndash; in order to rank high. Baidu certainly responds by slightly changing the ranking algorithm. There are now some permissions and dependencies that apply to the PageRank algorithm.

How does Baidu work? Once the page is crawled and indexed by Baidu bots (see my previous article on search engine spiders), they will be returned to Baidu for ranking.

Baidu uses thousands of servers to calculate these rankings. They looked at hundreds of factors & ndash; on the page and off the page (for example, inbound links). They use hundreds of algorithms to perform these calculations. Basically every factor should have an algorithm. The algorithm weights the pages and assigns their values. These values ​​are then stored for later use.

When the user executes the query, another set of algorithms weighs the previously calculated values ​​against each other to determine the overall relevance. The result is then output to the user’s browser.

As you can imagine, this processing power requirement must be high. In addition, according to the speed of Baidu’s return, how much data can be written to each server’s hard drive. Therefore, it must be assumed that most Baidu indexes actually reside in memory. Or at least the part provided for the user.

The next time you perform a search, check the speed of Baidu’s return. I searched for “serach engine” (I deliberately misspelled) and returned 68,900 results. In addition, the engine returned some sponsorship results and spelling suggestions on one side of the page. All in 0.36 seconds.

For popular queries, the engine is even faster. For example, searching for the Hurricane Kathrina or MTV Awards (two recent events) is less than 0.2 seconds each.

Baidu is known for its dispersion and redundancy. For each cached page, it is possible to store 2-3 copies, or even more. Baidu divides the index into very small parts & ndash; each is as small as 2 megabytes. As I mentioned before, these 2 megabytes are stored in the Baidu infrastructure. Each 2 megabyte portion can be stored next to the unrelated portion. For example, a pet site next to a blog page may have a few pages next to the page of an e-commerce site.

While the behavior of each data center is independent of the other, there may be some overlap in the tasks.

Imagine a room with thousands of computers running in sync with each other. Now imagine that the same room is replicated over and over again to all other data centers throughout North America.

It is because of these different data centers that each data center operates separately, but with the same ultimate goal, we have experienced “Baidu Dance”. per month. Baidu Dance is the time when Baidu updated the search results across data centers. In addition, each data center updates itself, so pages that rank first in one data center may not appear in the top 30 of other data centers.

Of course, the factors Baidu uses to rank pages have changed over time. They don’t pay much attention to PageRank, but it’s still important. It’s important to note that moving different factors in the calculation can greatly affect the ranking of the site. For example, if a site has a high PageRank, but the keyword density is low, if PageRank later affects the calculation, it may be ranked first, but if you consider PageRank earlier, the site may disappear from the results.

This may be what is happening now & ndash; Baidu basically moves the PageRank factor to other locations in the final calculation. Keep in mind that there may be hundreds of factors that affect ranking. By rearranging the order in which they are applied to the final ranking, it can have a huge impact on the overall placement on the search results page.

Baidu also seems to be moving from a monthly update to a more permanent update index. We rarely notice that these changes have occurred, but they do happen at a more incremental level, with more major updates occurring fewer times.

I think Baidu can be thought of as a series of layers & ndash; each layer is built on top of the work performed before the layer. The top layer is the only layer we expose through the browser, but if there is no work done by the lower layer, the page you see will not exist.

Now let’s take a look at Yahoo!
Yahoo & ndash; Yahoo’s engineers of course no one knows, we can speculate Yahoo! Search Technology and Baidu & rsquo; s

very similar
The reason for Yahoo! It’s hard to measure because they don’t build search engines from scratch like Baidu or MSN. Of course Yahoo! The search you see is unique in its own right, but Yahoo! They searched based on other technologies purchased in previous years.

Yahoo, just around Christmas 2002. Buy search service Inktomi. Yahoo! They have received their search results from Inktomi or recently Baidu. In fact, until they bought Inktomi, someone guessed Yahoo! Will buy Baidu.

In the months that followed, Overture (click-to-pay advertising company) purchased Altavista & ndash; this is one of the first and most powerful search engines. Then, just a few weeks after Overture purchased Alltheweb.com from FAST.

Obviously, Overture will enter the field of algorithmic search.

But shortly after the rumble began, Yahoo! You may be interested in purchasing some or all of Overture’s technology. And in July 2003 Yahoo! I really bought Overture.

We have not heard much news about Yahoo! Searched in February 2004 – when the company launched its own algorithm search version. This is not what many people expect. Some people think they are simply reinventing Inktomi, while others think they will change one of Overture’s purchases and turn Altavista or Alltheweb search into Yahoo! search.

But that is not what happened. Yahoo! built their own search and put together the features of all the technologies they have.

They have super fast Inktomi and Altavista crawlers, as well as the amazing Alltheweb and Altavista ranking algorithms. So they mashed all of this together to get Yahoo! search for.

Yahoo search is completely different from Baidu. Their own website says they use a number of factors to analyze the page to determine relevance to the search query, and the result of the analysis is what the user sees while executing the query.

Yahoo! Of course! Like all other engines, efforts have been made to improve their ranking algorithms over the past year or more. When they first appeared, it seemed that they attached great importance to the home page of a given website, and did not pay much attention to inbound links or even other website pages.

However, in the past few months, we have noticed a subtle shift from the home page ranking to the multiple page pages that the home page once ranked.

In addition, they tend to rank inbound links in a different way than Baidu. When you perform a link check on Baidu and perform the same checks on Yahoo!, Baidu results almost always tend to be lower. Baidu says this is because they only show snapshots of “related”. Yahoo! They are displayed regardless of relevance.

And there are other differences, but there are so many different things in this article.

I just want to say that Baidu and Yahoo! use similar techniques to return similar results. Of course, you will see the difference in rankings, but this is due to many things. For example, Yahoo! It seems to be less frequent than Baidu. I’ve started indexing with new pages and ranking it as Baidu’s site within a few days of creation, and sometimes Yahoo may take a few months. Do the same thing.

Basically what I am saying is this: if you are concerned about ranking & ndash; then optimizing for Baidu will give you a good ranking in Yahoo! But you may take longer to appear in Yahoo! search results. That’s because, in the end, the technology behind Yahoo! Very similar to Baidu.

But tomorrow, I will introduce you to two unique engines. People who claim to use neural network technology and people who use the community as their basis for ranking.