Michał Suski
August 14, 2019

TF-IDF for SEO compared with Prominent Words and Phrases

The relationship shared by words can be confusing at times, despite the fact that we have a number of ways to try and decipher this bond. One such method is known as TF-IDF, and it can allow you to determine the strength of the connection (relevance) between specific words within a chosen subject. What I aim to do here is teach you more about what TD-IDF is and the way in which it is calculated, as well as why it’s worth paying attention to in the first place when it comes to SEO articles optimization and digital marketing.

What is the TF-IDF algorithm (term frequency-inverse document frequency)?

Put simply; this is a method for calculating the weight of words based on how frequently they occur. Getting more technical, it belongs to a specific group of algorithms that calculate the statistical weight of a certain term. It might sound a little confusing, but don’t let that stop you from reading on, it sounds harder than it actually is.

The analysis process itself is based on how regularly the word occurs within the document, as well as the inverse frequency of the word within a specified collection of documents. This means that TF-IDF can show you which words have the most importance within the text provided.We also have the ability to refer to the top ten websites within your niche, and because of this, we can fully optimize your website using the high-frequency words.

TF – terms frequency

how many times a word appears divided by article length

Term frequency checks how often the term appears in the document in relation to the amount of content within it.

IDF – inversed document frequency

inverse documents frequency idf = logarithm from number of documents in the corpus divided by document in a collection	 where term occurs count

It calculates the inverse ratio of the number of documents in which the word occurred and compares it to all of the documents within the set. The best part is, it can determine popular words within topics as well, so the results are fitting for the area you are working in. A similar corpus is used for natural language processing.

TF*IDF = number of times a word appears divided by article length multiplicated with inverse document frequency

Why it is interesting for SEO?

“TF-IDF analysis allows you to optimise the balance of terms in your content according to what is already being shown by the algorithm.” – Matt Diggity

It is thanks to TF-IDF that we are able to discover words that are relevant in terms of the context of a particular expression. This is perfect for fully optimizing pages, as well as building up relevant topics to garner better SEO results. It also allows you to rank the words in order from most important to least, which means you can get a clear idea of the scope of the words for the selected topic. Understanding your competition for the blog, local news, or niche article will guide you through target keyword density and other ranking factors.

TF-IDF efficiency

The collection or corpus of documents that you are using is the basic variable that affects the final weight of the individual words in question when calculating the IDF. The problem with these collections is that the IDF needs to be recalculated for each of the words appearing in the documents, and it is the most effective and efficient way to do things.

The larger the set of data in question, the more data you will need to convert. This can cause problems with the infrastructure as well as issues regarding the determination of the size of the collection. In the end, however, the larger the set, the more accurate results you will receive.

Some of the problems that the efficiency of the TD-IDF calculation system are as follows:

  • Specifying the size of a set of documents.
  • Taking care of the impartiality of the collection.
  • Cyclic updates resulting from the creation of new words.
  • The need to have separate collections for different languages.
  • Extremely expensive TF-IDF calculation for expressions consisting of two or more words.

Despite all of this, TD-IDF (terms weighting) remains an exceptionally effective and useful algorithm when creating and optimizing articles. No SEO tool is perfect, and this one works excellently to make things clearer for you. You can read more about ranking results in SEO guide to TF-IDF  by Matt Diggity: https://diggitymarketing.com/tfidf-for-seo/and Authority Hacker: https://www.authorityhacker.com/tf-idf/

What we don’t know about TF-IDF?

Of course, there are things we don’t know about TF-IDF as well, and that is part of what makes it so interesting. We don’t know if Google or other search engines use TF-IDF, and even if it does, we have no idea what form. It is one of the most basic issues with it because of the fact it depends on the analysis of the collection or corpus documents to such a large extent.

If the set has been poorly matched or is incomplete, it will also skew the judgment of the overall term weight. For example, using Wikipedia as a set of documents for IDF tool will not necessarily work well because each of the collections will be biased to some degree.

However, if Google or other search engines do, in fact, use TF-IF, it is in a better position than any other tool on the market because its body is made up of every piece of content on the web. The results would, therefore, be impartial and useful for comparison. There are some private data correlation studies that have suggested it is highly likely, according to Matt Diggity.

Effective TF-IDF analysis requirements

In order to get the best and most effective results from TF-IDF analysis, the following is required:

  • A large set of documents in a collection or corpus for valuable analysis. 
  • A database with precalculated IDF for each term in order to get valuable results fast.

At Surfer, we realized how the use of incorrect datasets could easily blur the entire picture, and that it is also very hard to define the best datasets. We decided that it would be best to leave this decision to the Search Engine algorithm, so it could analyze the semantical features of the top ten ranking pages.

We believe that the top ten competitors are representative of the most relevant websites, according to Google. Surfer works to find the common words and phrases for these websites, and in many cases, the results are very similar to TD-IDF that has been used by other tools. After this, Surfer collects the second set of keywords; the most popular words and phrases that occur on each of the top ten websites.

It will then cross-examine both of the data sets (common and popular) before selecting the most meaningful keywords. Those that are less important (think privacy policies and terms and conditions) are rejected and set aside. Due to these operations, we get the most prominent words, just like TD-IDF, but we are also more resistant to potential errors.

It is also worth noting that TF-IDF can be a ranking factor even though there are a number of different algorithms that search engines are able to use. The prominent words and phrases function will work regardless of the method, while TF-IDF only works when we assume that Google uses this (or similar) techniques.

How Prominent words and phrases are computed

The process of determining Prominent words and phrases begins in the same way as the standard TF-IDF. That is to say from the calculation of the frequency of keyword occurrence. The results of this algorithm are then available in the form of tables; one for popular words, and one for popular phrases. It will appear here is it is one of the top 30, and it must appear in the article at least twice.

The second part of this process is known as the calculation of common words and expressions, and it is for the competitors appearing in the top ten results. We rely directly on the current set of pages provided by the Google (or other search engines) algorithm, and the word or phrase is listed if it appears on at least four pages from the top ten.

Both sets are then cut, and the results contain words and phrases that are found in both sets. The phrases that have been obtained in this manner paint an accurate and clear picture of the articles that are currently being promoted by Search Engine. This is regardless of the way in which said the content was analyzed.

Why we decided to go with Prominent instead of TF-IDF analysis

The results from TF-IDF are incredibly valuable, but they only contain words. Expressions are what differentiate the documents, and the TF-IDF analysis for expressions from a large database is almost impossible due to the number of calculations involved. There are two reasons that we rely heavily on Prominent analysis:

  • We can analyze expressions that have a greater differentiating value for SEO.
  • We are not trying to recreate the Google algorithm; we analyze its results. Thanks to this, Prominent words and phrases are independent of how (and if) Google uses TF-IDF.

SEO tools results comparison

We conducted an experiment on the analysis of TF-IDF and Prominent words and phrases. The phrase that we used was SEO services in the USA. When we ran the algorithm and examined the final results, we received the following conclusions:

  • More words and expressions were found in Prominent.
  • There was greater accuracy, and instead of the word SEO, we received a dozen variants.
insights about relevancy and keywords in use

As a side note, as the TD-IDF tool, we used SEObility. There is quite a lot of tools that will provide natural language analysis.

Conclusion

TF-IDF is both an effective and valuable way to optimize your article for a target keyword. However, we decided that our own solution would be the best way forward. Prominent words and phrases are able to provide more data and are based directly on the results that have been provided by the Google algorithm. You would find TF-IDF in our software if it provided better guidance like keyword density for each term.

True Density

All those calculations have a great outcome now which is True Density feature. It provides keyword density for all prominent terms and gives you simple suggestions to get rid of over and underoptimization. There is no such keyword density tool on the market that takes into account your and your competitors document length. 

improve your ranking in search results with true keyword density

Read more about True Density here!

In many ways, Prominent is very similar to TF-IDF, and the first part does remain the same through the process. The second part is the one that is different (as we have explained), but even so, it is not particularly complicated and works just as quickly as standard TD-IDF. With the results, you will find that they are much clearer and more detailed than usual thanks to the requirements for the Prominent list based on formula focusing on each competitor article analysis. It’s a system well worth using for SEO to create and optimize document that targets specific queries in English and any other language. Save time and add prominent words and phrases to your process.

This is how content optimization becomes a technical SEO.

Did you like it? Comment, share on social and make sure that we have your email to keep you posted about new features in Surfer tool and search engine marketing tips!

Share this on:
Join Surfer Club
Free content subscription packed with SEO
and content lessons for professionals