robots.txt - why it failed to stop Google from indexing folders?

tomaszjot

Membership Suspended
Dec 22, 2009
1,934
77
0
Albany Plantation
So I was checking the number of indexed pages in G just to see that instead of 200 plus pages I've got 700 plus pages indexed. Checked them. Most of them are hop links to aff offers, function folders with funtions etc. That is obviously not fucking good.

I protect by folders with robots.txt, I no followed aff links and any unnecessary shit and put NO FOLLOW, NO INDEX on unwanted pages yet Google indexed all shit. How is it possible? Any ideas?

I assume my robots.txt "code" was correct as I never had any issues before.

What could possibly happen? It happen on 3 sites which were on shared host (I know they shouldn't be there anyway, was about to move them from there). Is it possible that robots.txt were omitted for some reason?

Did something similar happen to anyone here?

I assume that this pages can disppear from idex after a while...
 


It's actually something I've noticed the last couple of months as well. Folders and files that have been nofollow/disallowed/noindex'd - the works - have been starting to pop up as indexed.

Same as you, hop links and other shit.
 
It's actually something I've noticed the last couple of months as well. Folders and files that have been nofollow/disallowed/noindex'd - the works - have been starting to pop up as indexed.

Same as you, hop links and other shit.

Thanks for answer. It's a bit worrying, definately not something you want to see. Wonder how it's affecting site SEO-wise.

I realse one site after another based on the similar setup, doesn't look like a good idea now. Need to look for different approach.
 
You are assuming Google needs to crawl your actual page to index it. Not true.

Your page can & will show up in SERPs without Google actually crawling the page itself. G will crawl the rest of your site + other sites to harvest enough data about your "hidden" url to create a snippet for the SERPs.

You'll need to use the URL removal tool in GWT to remove it.
 
  • Like
Reactions: tomaszjot
You'll need to use the URL removal tool in GWT to remove it.

Thanks omgyams

Well, this is something I'm not too happy about. I did not add this site to GWT yet.

There is some discussion here --> Google has ignored my robots.txt file - Webmaster Central Help if someone has similar problem

Especially interesting:

That said, robots.txt does not impede indexing of an URL restricted from crawling by robots.txt that's linked on a crawlable page. In these cases, the blank URL gets indexed, without any content shown on the SERPs because content remains invisible as crawling is forbidden.

and

(Notice that these are listed in the index as just a url without description or cache).


They are listed like this because you have blocked Googlebot from finding out what to do with them. These links appear somewhere on your site or another site.

thanks