Portions of Google's search algorithm were leaked:
An Anonymous Source Shared Thousands of Leaked Google Search API Documents with Me; Everyone in SEO Should See Them
By Rand Fishkin
On Sunday, May 5th, I received an email from a person claiming to have access to a massive leak of API documentation from inside Google’s Search division. The email further claimed that these leaked documents were confirmed as authentic by ex-Google employees, and that those ex-employees and others had shared additional, private information about Google’s search operations.
Many of their claims directly contradict public statements made by Googlers over the years, in particular the company’s repeated denial that click-centric user signals are employed, denial that subdomains are considered separately in rankings, denials of a sandbox for newer websites, denials that a domain’s age is collected or considered, and more.
Naturally, I was skeptical. The claims made by this source (who asked to remain anonymous) seemed extraordinary–claims like:
- In their early years, Google’s search team recognized a need for full clickstream data (every URL visited by a browser) for a large percent of web users to improve their search engine’s result quality.
- A system called “NavBoost” (cited by VP of Search, Pandu Nayak, in his DOJ case testimony) initially gathered data from Google’s Toolbar PageRank, and desire for more clickstream data served as the key motivation for creation of the Chrome browser (launched in 2008).
- NavBoost uses the number of searches for a given keyword to identify trending search demand, the number of clicks on a search result (I ran several experiments on this from 2013-2015), and long clicks versus short clicks (which I presented theories about in this 2015 video).
- Google utilizes cookie history, logged-in Chrome data, and pattern detection (referred to in the leak as “unsquashed” clicks versus “squashed” clicks) as effective means for fighting manual & automated click spam.
- NavBoost also scores queries for user intent. For example, certain thresholds of attention and clicks on videos or images will trigger video or image features for that query and related, NavBoost-associated queries.
- Google examines clicks and engagement on searches both during and after the main query (referred to as a “NavBoost query”). For instance, if many users search for “Rand Fishkin,” don’t find SparkToro, and immediately change their query to “SparkToro” and click SparkToro.com in the search result, SparkToro.com (and websites mentioning “SparkToro”) will receive a boost in the search results for the “Rand Fishkin” keyword.
- NavBoost’s data is used at the host level for evaluating a site’s overall quality (my anonymous source speculated that this could be what Google and SEOs called “Panda”). This evaluation can result in a boost or a demotion.
- Other minor factors such as penalties for domain names that exactly match unbranded search queries (e.g. mens-luxury-watches.com or milwaukee-homes-for-sale.net), a newer “BabyPanda” score, and spam signals are also considered during the quality evaluation process.
- NavBoost geo-fences click data, taking into account country and state/province levels, as well as mobile versus desktop usage. However, if Google lacks data for certain regions or user-agents, they may apply the process universally to the query results.
- During the Covid-19 pandemic, Google employed whitelists for websites that could appear high in the results for Covid-related searches
- Similarly, during democratic elections, Google employed whitelists for sites that should be shown (or demoted) for election-related information
[continued ...]
sparktoro.com
Google confirms the leaked Search documents are real
theverge.com
Tom |