Indexing Obstacles for Dynamic Web Sites
URL Parameters, Session ID’s, Reserved Characters and Deep Nested web pages can make it harder for search engines to fully index your website. If your site uses any of the following you may find it hard to get indexed by the search engines:
Parameters
If your URL’s include parameters (end with ?a=1&b=2) then the search engines may not index these pages. This is because the spider can get caught in an infinite loop, indexing the same page hundreds of times with exactly the same content.
It used to be that no search engines would index pages with parameters. This is now much improved to how it used to be, however to ensure your site is indexed by all the search engine spiders always limit to a maximum of two parameters, but if possible use none.
Session IDs
Question – What’s worse than a search engine not indexing a URL with a session ID?
Answer – A search engine that does index a URL with a session ID
If a search engine indexed pages with session IDs the following could happen:
- Visitors coming from that search using the same session id share a shopping cart, exposing order and shipping history, and potentially credit card information
- The search engines index the same page with different session ids resulting in a duplicate content penalty
- The search engines ignore the page thus ensuring your pages are not indexed
Sometimes this problem can be extremely hard to see, the site may only use the URL session id when a visitor doesn’t accept cookies – no spiders accept cookies so they would always see the session ID. To check this properly turn off accepting cookies on your browser, clear your cookie cache, and then access your site.
Reserved Characters
Moving away from standard alphanumeric characters in a URL can cause issues. HTML and URL’s reserve certain characters to serve special uses. An example of this is the & character, which in URL’s is used to divide parameters. If your site uses these non-alphanumeric characters you need to encode them whenever they’re listed. If someone links to you from their own site they could mis-enter it causing it to point to the wrong URL.
# (pound sign or hash sign) are used for accessing anchors on pages, these can be kept when used for this purpose but avoid filesnames with these in.
Spaces are also a reserved character, being rendered as %20 in URLs. When representing a space in the URL, use a hypen (‘-‘) instead.
Nested Too Deep
While not really about the syntax for URLs, if your page is more than three levels deep on your site the search engines may deem it irrelevant and not index it.
Depth should be measured from the home page of your site, count the number of clicks it takes to get t the destination page, add one (so you include home) and that’s your level. Keep at most three deep, if you can’t consider adding a search engine friendly site map.
Links
If you’re still having problems getting your site indexed it might just be lack of incoming links. Arrange non-reciprocated and reciprocated links from good quality sites. Instead of arranging for all your links to go to your home page, try to arrange deep links to your internal pages – some of the directories allow deep linking to internal pages as well as listing your home page.