How to Remove the Unwanted Web Pages to Improve your Ranking in Search Results

SEO Audit: Scrape and Remove Your Indexed Web Pages

There will be lot of reason to scrape the Google Search Results. Here, I will show you how to scrape the search results to improve your website’s visibility in Google.

Everyone wants to rank good and get more traffic with less bounce rate, Not on-page and off-page alone won’t give you the best result, you always need to provide a good user experience not only for the users but also for the search bots

1) How to Scrape the list of URLs in Google Search Result?

2) How to see the HTTPS status code (200, 301, 404) of the indexed URL?

3) How to Remove the URLs from the search list?

Why do you need to remove the Indexed Web Pages?

This method helps you to improve the search ranking and traffic by removing the duplicate pages and unwanted thin content pages or dynamic URLs from the search index.

Britney Muller of Moz deindexed 75% of Moz’s website and found huge success.

Even Matt Cutts has mentioned this long back which most of them are not bothered about.

How to Scrape Google Search Results?

There are a lot of methods, premium tools, freemium tools and python script availabe to scrape the indexed web pages in Google search results but I’ll show you which totally free and easy.

First, download the chrome extension linkclump. Go to the settings of the extension then configure the extension to copy only links. (You can even configure it to scrape both titles and URLs)

linkclump - get the list of urls

Now head back to Google search Click Settings>Search Settings and change the results per page to 100 then click save and you will be prompted with captcha verification.

Now we will make use of the Google Search operator “site:”
Ex. site:digitalgasm.com
In case, you need to get only the https pages you can fine-tune the search with the https or you can even do that with a specified path
Ex. site:https://digitalgasm.com or Ex. site:https:digitalgasm.com/seo

google site search operator

You can try multiple combinations according to your requirements using the search operators but to get most of your indexed pages in the search results I recommend to go with site:yourdomain.com, hit search and before going to the next step visit the last page, click “repeat the search with the omitted results included”.

If you don’t see something like the above on the last page it’s fine, come back again to the first page. Then hold the “Z” key and left click to together drag to the bottom of the search result (the shortcut key by default it’s “Z” for the linkclump). Then open the Google Sheets and paste it. Repeat the same step on other pages as well until you reach the last search result.

Now, you have the list of URLs that been indexed by Google. The next step to see the redirected pages, unresponsive pages and broken pages.

Check the HTTP status code for the list of URLs

As you have already copied the list of URLs in the Google sheets. Now head to Google App script where you are allowed to create your own custom scripts that will interact with Google suite products.

Click tools>script editor in your Google Sheets.

Google Sheets script editor

Copy the script from the below and paste it on the Google Script save it as “getStatusCode” The work is done on the script now back to Google Sheet

function getStatusCode(url){
   var options = {
     'muteHttpExceptions': true,
     'followRedirects': false
   };
   var url_trimmed = url.trim();
   var response = UrlFetchApp.fetch(url_trimmed, options);
   return response.getResponseCode();
}

Now to run your script type =getStatusCode(a2) in the cell next to your URL and hit enter to execute. If you see the results in the cell you function is working perfectly now just drag the same function up to the last URL in your list it will take some time depending upon the number of URLs. Once it’s done you could see the status code in the cell.

In case you need to know in detail about this method you can refer to this article “How to use google spreadsheets to check for broken links” I came through this specific hack. Before that, I was using SEO tools for excel which is a premium tool and even Xenu sleuth tool which is free. But I prefer this because it’s easy, free and works without installing anything on your laptop.

Then handpick or sort the web pages using the status code. Take out the 404 pages, sort the thin content pages and duplicate pages then redirect them relevant pages or homepage.

How to remove the indexed webpage from Google’s search list?

After creating the redirect, Google will automatically remove those pages from the search list but in case you have many pages and just want to speed up the process you can then continue this step.

Download the Chrome extension Google webmaster bulk removal tool, extract the ZIP file. Then go the extension in the chrome switch on the developer mode which will be on the right corner and then click load unpacked, choose the folder of the extracted ZIP file.

chrome developer mode

Head to the Google Search Console then Google index>Remove URLs. There will be an option to choose the text file.

search console removal tool

Copy all the links that you want to remove from the search index paste it in a text file and upload it there. The extension will automatically submit all the URLs in the text file. And this is temporary hide if you haven’t redirected or not excluded the web page in robots or meta robots. The page will get again indexed in Google search.

That’s it you are done now. You could see the page will get removed in the search results in few days and there will be a significant increase in the ranking (not after few days) if you have submitted many duplicates and thin content pages in the search result.

Let me know what is your technique and how helpful is this in the comment box.

2 thoughts on “SEO Audit: Scrape and Remove Your Indexed Web Pages”

    1. Hi Emma, thanks for your interest. I have no idea of the guest blog as of now. I appreciate your effort of reaching out.

Leave a Comment

Your email address will not be published. Required fields are marked *