I do not simply imply looking at where our websites rank for a specific keyword or set of keywords, I indicate the actual content of the SERPs.
For every single keyword you search in Google where you broaden the SERP to show 100 results, you’re going to discover, on average, around 3,000 words.
That’s a lot of content, and the factor it has the possible to be so important to an SEO is that a great deal of it has actually been algorithmically rewritten or cherry-picked from a page by Google to best address what it thinks the needs of the searcher are.
Ask yourself: why would Google wish to do that?
It must take a fair quantity of resources when it would simply be much easier to display the customized meta description assigned to a page.
The answer, in my opinion, is that Google only cares about the searcher– not the poor soul charged with writing a new meta description for a page.
Google appreciates developing the finest search experience today, so individuals come back and search once again tomorrow.
One way it does that is by selecting the parts of a page it wishes to appear in a SERP feature or in SERP-displayed metadata that it thinks best match the context or query-intent an individual has when they utilize the search engine.
With that in mind, the capability to evaluate the language of the SERPs at scale has the possible to be an exceptionally valuable method for an SEO, and not just to enhance ranking efficiency.
This type of approach can help you better comprehend the needs and desires of prospective consumers, and it can assist you understand the vocabulary likely to resonate with them and associated subjects they wish to engage with.
In this article, you’ll learn some strategies you can use to do this at scale.
Be cautioned, these techniques are dependent on Python— however I want to reveal this is nothing to be afraid of. In fact, it’s the perfect opportunity to try and discover it.
Do Not Worry Python
I am not a developer, and have no coding background beyond some fundamental HTML and CSS. I have actually picked Python up relatively just recently, and for that, I have Robin Lord from Distilled to thank.
I can not suggest enough that you have a look at his slides on Python and his very beneficial and quickly available guide on using Jupyter Notebooks– all consisted of in this useful Dropbox
For me, Python was something that always appeared challenging to understand– I didn’t know where the scripts I was attempting to utilize were going, what was working, what wasn’t and what output I need to expect.
If you’re in that circumstance, checked out Lord’s guide. It will assist you realize that it doesn’t need to be that way and that working with Python in a Jupyter Notebook is actually more uncomplicated than you might think.
It will also put each technique referenced in this article quickly within reach, and provide you a platform to conduct your own research and established some effective Python automation of your own.
Getting Your SERP Data
As an employee, I’m lucky to have access to Conductor where we can run SERP reports, which utilize an external API to pull SERP-displayed metadata for a set of keywords.
This is a simple way of getting the data we need in a great tidy format we can work with.
It appears like this:
Another method to get this details at scale is to use a custom-made extraction on the SERPs with a tool like Yelling Frog or DeepCrawl.
I have blogged about how to do this, but be alerted: it is maybe just a tiny little irrelevant bit in violation of Google’s terms of service, so do it at your own peril (but keep in mind, proxies are the perfect remedy to this danger).
Additionally, if you are a fan of paradox and think it’s a touch rich that Google says you can’t scrape its material to offer your users a much better service, then please, by all ways, deploy this strategy with glee.
If you aren’t comfortable with this method, there are also numerous APIs that are quite cost-effective, simple to utilize and provide the SERP data you require to run this kind of analysis.
The last technique of getting the SERP information in a clean format is slightly more lengthy, and you’re going to need to utilize the Scraper Chrome extension and do it manually for each keyword.
If you’re actually going to scale this up and want to deal with a fairly large corpus (a term I’m going to utilize a lot– it’s simply an expensive method of saying a lot of words) to perform your analysis, this final alternative most likely isn’t going to work.
However, if you have an interest in the idea and wish to run some smaller sized tests to make sure the output is important and appropriate to your own projects, I ‘d say it’s completely fine.
Ideally, at this stage, you’re all set and happy to start with Python using a Jupyter Notebook, and you have actually got some well formatted SERP data to deal with.
Let’s get to the intriguing stuff.
SERP Data & Linguistic Analysis
As I’ve pointed out above, I’m not a designer, coding expert, or computer system scientist.
What I am is somebody interested in words, language, and linguistic analysis (the skeptics out there may call me an unsuccessful journalist attempting to scratch out a living in SEO and digital marketing).
That’s why I’ve become interested with how real information researchers are utilizing Python, NLP, and NLU to do this type of research study.
Simply put, all I’m doing here is leveraging tried and tested approaches for linguistic analysis and finding a method to use them in a way that pertains to SEO.
For most of this short article, I’ll be speaking about the SERPs, however as I’ll describe at the end, this is just scratching the surface of what is possible (and that’s what makes this so exciting!).
Cleaning Text for Analysis
At this point, I need to mention that a very essential requirement of this kind of analysis is ‘tidy text’. This type of ‘pre-processing’ is necessary in ensuring you get an excellent quality set of outcomes.
While there are lots of great resources out there about preparing text for analysis, for the sake of levity, you can assume that my text has actually been through most or all of the listed below processes:
- Lower case: The methods I mention below are case sensitive, so making all the copy we use lower case will prevent duplication (if you didn’t do this, ‘yoga’ and ‘Yoga’ would be dealt with as two different words)
- Eliminate punctuation: Punctuation doesn’t add any extra info for this kind of analysis, so we’ll require to eliminate it from our corpus
- Get rid of stop words: ‘Stop words’ are frequently happening words within a corpus that add no worth to our analysis. In the examples listed below, I’ll be utilizing predefined libraries from the exceptional NLTK or spaCy packages to eliminate stop words.
- Spelling correction: If you’re fretted about incorrect spellings skewing your data, you can use a Python library like TextBlob that provides spelling correction
- Tokenization: This procedure will convert our corpus into a series of words. For example, this:
([‘This is a sentence’])
will end up being:
([‘this’, ‘is’, ‘a’, ‘sentence’])
- Stemming: This describes getting rid of suffixes like ‘- ing’, ‘- ly’ and so on from words and is totally optional
- Lemmatization: Comparable to ‘stemming,’ however rather than simply eliminating the suffix for a word, lemmatization will transform a word to its root (e.g. “playing” ends up being “play”). Lemmatization is frequently preferred to stemming.
This might all sound a bit complicated, however do not let it discourage you from pursuing this kind of research study.
I’ll be connecting out to resources throughout this article which break down exactly how you use these processes to your corpus.
NGram Analysis & Co-Occurrence
This very first and most basic method that we can use to our SERP material is an analysis of nGram co-occurrence. This means we’re counting the variety of times a word or combination of words appears within our corpus.
Why is this useful?
Evaluating our SERPs for co-occurring series of words can offer a photo of what words or expressions Google considers most appropriate to the set of keywords we are examining.
For example, to produce the corpus I’ll be utilizing through this post, I have actually pulled the top 100 results for 100 keywords around yoga
This is just for illustrative purposes; if I was doing this workout with more quality control, the structure of this corpus might look a little different.
All I’m going to utilize now is the Python counter function, which is going to try to find the most typically taking place mixes of 2- and three-word expressions in my corpus.
The output appears like this:
You can already begin to see some intriguing patterns appearing around topics that searchers may be interested in. I might also collect MSV for a few of these phrases that I might target as additional project keywords.
At this point, you might believe that it’s obvious all these co-occurring phrases contain the word yoga as that is the main focus of my dataset.
This would be an astute observation– it’s referred to as a ‘corpus-specific stopword’, and because I’m working with Python it’s basic to develop either a filter or a function that can get rid of those words.
My output then becomes this:
These 2 examples can help provide a snapshot of the topics that competitors are covering on their landing pages.
For instance, if you wished to demonstrate content gaps in your landing pages versus your top performing competitors, you might utilize a table like this to highlight these repeating styles.
Integrating them is going to make your landing pages more thorough, and will produce a better user experience.
The very best tutorial that I’ve found for creating a counter like the one I have actually utilized above can be found in the example Jupyter Notebook that Robin Lord has actually created(the exact same one linked to above). It will take you through exactly what you need to do, with examples, to produce a table like the one you can see above.
That’s pretty fundamental though, and isn’t constantly going to offer you results that are actionable.
So what other kinds of useful analysis can we run?
Part of Speech (PoS) Tagging & Analysis
PoS tagging is specified as:
” In corpus linguistics, Part-Of-Speech Tagging (POS tagging or POST), also called grammatical tagging, is the procedure of marking up a word in a text (corpus) as representing a particular part of speech, based on both its definition, as well as its context– i.e. relationship with surrounding and related words in a phrase, sentence, or paragraph.”
What this means is that we can designate every word in our SERP corpus a PoS tag based not just on the definition of the word, but also the context with which it appears in a SERP-displayed meta description or page title.
This is effective, because what it implies is that we can drill down into particular PoS categories (verbs, nouns, adjectives and so on), and this can offer important insights around how the language of the SERPs is constructed.
Side note— In this example, I am utilizing the NLTK bundle for PoS tagging. Regrettably, PoS tagging in NLTK isn’t offered in many languages.
If you have an interest in pursuing this technique for languages other than English, I advise taking a look at TreeTagger, which uses this functionality throughout a number of various languages.
Using our SERP material (remembering it has actually been ‘pre-processed’ using some of the methods pointed out earlier in the post) for PoS tagging, we can expect an output like this in our Jupyter Notebook:
You can see each word now has a PoS tag appointed to it. Click on this link for a glossary of what each of the PoS tags you’ll see mean.
In isolation, this isn’t particularly useful, so let’s develop some visualizations (don’t stress if it looks like I’m leaping ahead here, I’ll connect to a guide at the end of this section which shows exactly how to do this) and drill into the outcomes:
I can quickly and easily identify the linguistic patterns throughout my SERPs and I can begin to element that into the approach I take when I enhance landing pages for those terms.
This suggests that I’m not only going to optimize for the question term by including it a particular variety of times on a page (thinking beyond that old-fashioned keyword density frame of mind).
Instead, I’m going to target the context and intent that Google seems to prefer based on the ideas it’s providing me through the language utilized in the SERPs.
In this case, those clues are the most typically taking place nouns, verbs, and adjectives throughout the results pages.
We know, based upon patents Google has around phrase-based indexing, that it has the prospective to utilize “associated expressions” as an element when it is ranking pages.
These are most likely to consist of semantically relevant expressions that co-occur on top performing landing pages and help crystalize the significance of those pages to the search engines.
This kind of research study may provide us some insight into what those associated expressions could be, so factoring them into landing pages has the possible to be important.
Now, to make all this SERP material really actionable, your analysis requires to be more targeted.
Well, the great feature of developing your own script for this analysis is that it’s really simple to use filters and section your data.
For example, with a few keystrokes I can create an output that will compare Page 1 patterns vs. Page 2:
If there are any apparent differences between what I see on Page 1 of the results versus Page 2 (for instance “beginning” being the most common verb on Page 1 vs “training” on Page 2), then I will drill into this further.
These might be the kinds of words that I place more focus on during on page optimization to give the search engines clearer signals about the context of my landing page and how it matches query-intent.
I can now begin to build an image of what type of language Google selects to display in the SERPs for the leading ranking outcomes throughout my target vertical.
I can also utilize this as a tip as to the type of vocabulary that will resonate with searchers searching for my service or products, and integrate some of those terms into my landing pages appropriately.
I can also classify my keywords based on structure, intent, or a phase in the buying journey and run the same analysis to compare patterns to make my actions more specific to the outcomes I wish to achieve.
For instance, trends between yoga keywords modified with the word “novice” versus those that are modified with the word “advanced”.
This will give me more clues about what Google thinks is necessary to searchers looking for those types of terms, and how I might be able to better enhance for those terms.
If you desire to run this type of analysis for your SERP data, follow this basic walkthrough by Kaggle based upon using PoS tagging to film titles It strolls you through the process I have actually gone through to develop the visuals utilized in the screenshots above.
Topic Modeling Based Upon SERP Data
Topic modeling is another really beneficial technique that can be deployed for our SERP analysis. What it refers to is a procedure of drawing out subjects concealed in a corpus of text; in our case the SERPs, for our set of target keywords.
While there are a variety of different techniques for topic modeling, the one that appears favored by information researchers is LDA (Hidden Dirichlet Allocation), so that is the one I picked to deal with.
A fantastic explanation of how LDA for subject modeling works originates from the Analytics Vidhya blog site:
” LDA assumes documents are produced from a mixture of topics. Those subjects then create words based on their probability distribution. Offered a dataset of documents, LDA backtracks and attempts to find out what topics would create those documents in the very first place.”
Although our keywords are all about ‘yoga’, the LDA system we utilize presumes that within that corpus there will be a set of other subjects.
We can likewise utilize the Jupyter Note pad user interface to create interactive visuals of these topics and the “keywords” they are constructed from.
The reason that topic modeling from our SERP corpus can be so important to an SEO, content online marketer or digital marketer is that the subjects are being built based on what Google believes is most appropriate to a searcher in our target vertical (remember, Google algorithmically rewords the SERPs).
With our SERP material corpus, let’s have a look at the output for our yoga keyword (pictured utilizing the PyLDAvis plan):
You can discover an extensive meaning of how this visualization is computed here
To summarize, in my own painfully unscientific way, the circles represent the various subjects discovered within the corpus (based on clever device finding out voodoo). The additional away the circles are, the more distinct those topics are from one another.
The list of terms in the right of the visualization are the words that produce these topics. These words are what I utilize to comprehend the main subject, and the part of the visualization that has real value.
In the video listed below, I’ll show you how I can engage with this visual:
At a glimpse, we’ll be able to see what subtopics Google believes searchers are most thinking about. This can end up being another important information point for material ideation, and the list of terms the subjects are built from can be used for topical on-page optimization.
The information here can likewise have applications in enhancing content recommendations throughout a site and internal connecting.
For example, if we are creating content around ‘topic cluster 4’ and we have a post about the very best beginner yoga presents, we know that somebody reading that article may also be interested in a guide to improving posture with yoga.
This is since ‘topic cluster 4’ is made up of words like this:
I can also export the list of associated terms for my subjects in an Excel format, so it’s easy to show other groups that might discover the insights handy (your material team, for instance):
Eventually, subjects are characteristic of the corpus we’re evaluating. Although there’s some argument around the useful application of topic modeling, building a much better understanding of the attributes of the SERPs we’re targeting will assist us better enhance for them. That is important.
One last point on this, LDA does not identify the subjects it develops– that’s down to us– so how applicable this research is to our SEO or content campaigns is dependent on how distinct and clear our topics are.
The screenshot above is what a great subject cluster map will look like, but what you wish to avoid is something that appears like the next screenshot. The overlapping circles tell us the topics aren’t unique enough:
You can avoid this by making sure the quality of your corpus is excellent (i.e. remove stop words, lemmatization, and so on), and by investigating how to train your LDA design to determine the ‘cleanest’ subject clusters based on your corpus.
Interested in applying topic modeling to your research study? Here is a fantastic tutorial taking you through the entire process
What Else Can You Make With This Analysis?
While there are some tools already out there that usage these type of strategies to enhance on-page SEO performance, support material groups and offer user insights, I’m an advocate for establishing your own scripts/tools.
Why? Since you have more control over the input and output (i.e., you aren’t just popping a keyword into a search bar and taking the outcomes at face value).
With scripts like this you can be more selective with the corpus you use and the outcomes it produces by using filters to your PoS analysis, or fine-tuning your subject modeling method, for instance.
The more vital reason is that it allows you to create something that has more than one beneficial application.
For example, I can develop a brand-new corpus out of sub-Reddit comments for the subject or vertical I’m looking into.
Doing PoS analysis or subject modeling on a dataset like that can be truly informative for understanding the language of possible consumers or what is likely to resonate with them.
The most apparent alternative use case for this kind of analysis is to create your corpus from material on the top ranking pages, rather than the SERPs themselves.
Once again, the similarity Screaming Frog and DeepCrawl make it fairly basic to extract copy from a landing page.
This material can be merged and used as your corpus to collect insights on co-occurring terms and the on-page content structure of leading carrying out landing pages.
If you start to work with some of these methods for yourself, I ‘d likewise recommend you research how to apply a layer of sentiment analysis This would enable you to try to find trends in words with a positive belief versus those with an unfavorable sentiment– this can be a beneficial filter.
I hope this article has given you some motivation for analyzing the language of the SERPs.
You can get some terrific insights on:
- What kinds of content might resonate with your target audience.
- How you can better structure your on-page optimization to account for more than just the query term, however also context and intent.
- How to Scrape Google SERPs to Enhance for Browse Intent
- Checking Out the Function of Material Groups & Browse Intent in SEO
- Advanced Technical SEO: A Complete Guide
Included Image: Unsplash
All screenshots taken by author, June 2019