
No 4, September 2009
Funnelback PI: Digging up the Dirt on Your Organisation’s Website
Author: Dan Nitsche, Technical Consultant, Pre and Post Sales
When introducing a new application to your organisation, it doesn’t take much digging to awaken the business monsters that have been hiding in the dark corners of your office. Be it the business process that’s “always been like that”, or the files that “are on this floppy disc”, be prepared for some shocks at what’s been going on behind the scenes.
Funnelback is no exception, and is likely to find all kinds of things you didn’t know about your website and Intranet. Here are some interesting examples you can try over your own Funnelback installation:
t:untitled – all results with “Untitled” in the title
t:”index of” “parent directory” – automatic listings of files generated by Apache
t:”pdf” f:”pdf” – PDFs with bad titles
t:”doc” f:”doc” – DOCs with bad titles
t:”404” – 404 not found pages that aren’t using a proper HTTP 404 response code
v:old v:archive v:backup v:log – old, outdated and unnecessary content
v:login – a good place to start looking when trying to gain access to your systems (also try t:login password)
Search for any previous names your organisation has gone under
You may also find some interesting results for terms like “Viagra”, “Online pharmacy”, “Get out of debt” or “Online degree” etc. In most cases, one wouldn’t expect these terms to return any results but on closer investigation (try viewing the HTML source) these types of searches can reveal comments, pages or entire websites that may have been compromised by spammers.
- Whatever the Funnelback crawler may find, it’s important to allocate time within your organisation to:
Create and maintain an accurate robots.txt
Review your server configuration for proper HTTP error codes and date stamps
Remove restricted, inappropriate or old content from your website and Intranet
Review outdated, inappropriate or sensitive content from your website and/or Intranet
This will ensure not only your Funnelback search results are appropriate, but will help external search engines to avoid any content you would rather keep hidden.