Archive for the 'googlebot' Category

9 ways Google is discovering the invisible web

Tuesday, July 1st, 2008

There are many parts of the web that Googlebot has not been able to access, but Google has been working to shrink that. Google wants to find content, and while many webmasters do not make it easy, Googlebot finds a way.

1. Crawling flash!
Adobe announced today that they have released technology and information to Google and Yahoo enabling them to crawl flash files. It may take the search engines some time before they are able to integrate and implement these abilities, but a time is coming where rich media is less of a liability. I wonder if MSN/Live was left out to prevent them from reverse engineering Flash for their new silverlight competitor? At any rate, MSN is still working on accessing text links, so let’s not swamp them.

2. Crawling forms
Googlebot recently started filling out forms on the web in an attempt to discover content hidden behind jump menus and other forms. See our previous article if you’d like to keep Google out of your forms.

3. Working with Government entities to make information more accessible
A year or so ago, Google started providing training to government agencies to assist them in getting their information onto the web. I’m assuming much of the information has been hidden by URLs with large amounts of parameters.

4. Crawling JavaScript
Many menus and other dynamic navigation features have been created in JavaScript, and googlebot has started crawling those as well. Instead of relying on webmasters to provide search friendly navigation, Google is finally getting to access sites created by neophyte webmasters that haven’t been paying attention.

5. Google’s patent to read text in images
Google also knows many newbie webmasters use text buttons for navigation. By attempting to read text in images, the Googlebot will once again be able to open up previously inaccessible areas of a site.

6. Inbound links
Of course, Googlebot has always been great at following inbound links to new content. Much of the invisible web has been discovered just through humans linking to a previously unknown resource.

7. Submission
Of course, you can always submit a page location of currently invisible content to Google. This is usually the slowest way, especially compared to inbound links.

8. Google toolbar visits, analytics
Recently, many SEO professionals have noticed links being indexed that have not been submitted. The only plausible explanation was that Google has been mining it’s toolbar and analytics for information about new URLs. Be careful - Google is watching and sees all!

9. Sitemap.xml files
The somewhat new stemap.xml protocol is very helpful for webmasters and googlebots alike in getting formerly invisible content into google’s hands.

Tags: , , , , ,

5 web development techniques to prevent Google from crawling your HTML forms

Friday, April 18th, 2008

Google has recently decided to let it’s Googlebot crawl through forms in an effort to index the “Deep Web”. There are numerous stories about wayward crawlers deleting and changing content through submitting forms, and it’s about to get worse. Googlebot is about to start submitting forms in an effort to get to your website’s deeper data. So what’s a web developer to do?

1. Use GET and POST requests correctly
Use GET requests in forms to look up information, use POST requests to make changes. Google will only be crawling forms via GET requests, so following this “Best Practice” for forms is vital.

2. Make sure your POST forms do not respond to GET requests
It sounds so simple, but many sites are being exploited for XSS (Cross Site Scripting) vulnerabilities because they respond (and return HTML) to both GET and POST requests. Be sure to check your form input carefully on the backend, and for heaven’s sake - do not use globals!

3. Use robots.txt to keep robots OUT
robots.txt file keeps Googlebot out of where it doesn’t belong. Luckily, Googlebot will continue it’s excellent support of robots.txt directives when it goes crawling through forms. Be sure not to accidentally restrict your website too much, however. Keep the directives simple, excluding by directory if possible. And test, test, test in Google’s Webmaster Tools!

4. Use robots metatag directives
Using the robots metatag directives for more refined control. We recommend “nofollow” and “noindex” directives for both the form submission page and search results pages you want Google to stay out of, even though Google says disallowing the form submission page is enough. Consider using tags and category pages that are Google friendly instead.

5. Use a CAPTCHA where possible
Googlebot isn’t going to fill out a CAPTCHA, so it’s an easy way to make sure some bot isn’t filling out your form.

Googlebot is, of course, the nicest bot you can hope to have visit your website. This provides a chance to secure forms and take necessary precautions before other - not so polite - bots visit your forms.

Tags: , , , , ,

7 untimely ways for a SEO to die

Friday, May 11th, 2007

In ancient Rome, the ghosts of the ancestors were appeased during Lemuria on May 9. Not many people know that, and even fewer care. But in the spirit of Lemuria, we offer seven untimely ways a SEO can die(It’s a dangerous world out there, and also I’m low on blog posting ideas):

- Bitten by search engine crawlers.

- Trampled by googlebots(This is actually the best way to go, if you have to).

- Trip over a HTML tag someone forgot to close. (This was funnier last night when I thought of it - go figure)

- You get (google)whacked while visiting a bad link neighborhood.

- You’re doing the googledance, slip on a banana peel and hit your head. Certainly I’m not the only one who knows the googledance? Please submit your videos if you know it: googledance@hyperdogmedia.com.

- You receive a suspicious package in the mail, and it turns out to be a googlebomb.

- Setting linkbait traps and you get an arm caught.

Please submit any other ideas you might have via email: lemuria@hyperdogmedia.com. So strike up that pun machine, it’s Friday!

Update: Debra just suggested you could “overdose on link juice” - if only!

Tags: , , , , ,