Factor/To do/Spider

  • make filters compile somehow
  • random sleep
  • redirects
  • https
  • cookies
  • connect timeout, page timeout, data timeout, overall timeout, stopping spiders if overall timeout is reached
  • parse robots.txt and make filters for it
  • flag to disable robots.txt ;)

Not immediately needed

  • parallel version
  • retry framework
  • retry connection-failed
  • option to turn off dns caching
  • proxies
  • option to check if pages exist but not download them
  • custom user agent string
  • custom http headers
  • spidering of results of a spider
  • save to database
  • save to directories/files
  • follow relative links only
  • support ftp spidering
  • bytes per second download rate limit
  • download quota
  • quiet mode
  • prefer ipv4/ipv6

This revision created on Thu, 2 Oct 2008 05:19:14 by erg