Download of entire web site

Sun Oct 6 08:37:40 PDT 2002

On Sun, Oct 06, 2002 at 12:38:49AM -0800, David White wrote:
> Has anyone on this list heard of web crawlers that download an entire
> site - including 100s of megabytes of images and thousands of
> dynamically generated pages?  It's almost like someone was trying to set
> up a duplicate site.
> 
> The download was apparently by one computer - but it used 4 IP
> addresses, each with a different useragent: Win95, Mac_PowerPC, Win2000,
> and Konqueror.  The session cookie was the same for all 4 IP addresses,
> suggesting that it was a single computer - but the different useragent
> strings suggest that it was trying to make itself less conspicuous.
> I've blocked the 4 addresses.
> 
> This doesn't seem like legitimate web crawler behavior.  Has anyone
> encountered this before?  I'm worried that someone is trying to do
> something bad - but so far I can't figure out what.

I publish a lot in the way of technical content. It is *very* common for
people to download entire trees of docs from my site to their local
machine using wget or other web site archivers. Most of the users are
kind and only grabe a single tree and not my whole site. However, I have
had users grab my whole site.

Some reasons for this behavior:
1) They want to search for certain content and find command line
searching of files to be more effective.
2) They dont have internet access at home and they DL to a ZIP disk and
take the zip disk home to browse off-line.
3) They are attempting a 4 way DDoS when spread across multiple
machines. (stupid)
4) They are trying to be sneaky and make connections appear they are
coming from different clients to grab all of your published web content
to try to find securtiy holes in the source so they may attack your
site. This may include attemts to grab publish password files and other
"hidden" content.
5) You have porn, MP3, or other desired content on your web site (or at
least the client *thinks* you have such content) and they are going to
DL you whole site looking for it.
6-X) ... (other ideas)

I have been known to do dl a tree of a site when I want to grab content
to read off-line and when I am "looking for holes" in a site. This has
include movie sites that have games to see special content but require
"passwords" to gain access. Soemtimes this can be effective when
java/javascript are used and they expect the client to determine its own
authentication for content. (heh) (This is not computer trespass, as the
public is permitted to have access to the content if they have a valid
password which can be gained by playing their games. This is not access
to private unpublished data.)

Try doing a reverse DNS oon the IP addresses, and a whois lookup. See
who owns them. I have been using jwhois lately as it seems to do well
with many US, European and Asian whois lookups unlike the regular whois
which can sometimes have problems with overseas whois lookups.

another way, is by using dig:
$ dig -x 172.31.254.254
(example)
to do reverse lookup by IP. 

If it resolves to a search robot, then you have a good idea who was
doing it, and you can try using the /robots.txt system for content.

> (I've been monitoring this list for years, but I can't remember whether
> I've posted before.)

If this is your first post, Welcome to our group and list! :-)
else, welcome back! :-)

-ME

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCS/CM$/IT$/LS$/S/O$ !d--(++) !s !a+++(-----) C++$(++++) U++++$(+$) P+$>+++ 
L+++$(++) E W+++$(+) N+ o K w+$>++>+++ O-@ M+$ V-$>- !PS !PE Y+ PGP++
t at -(++) 5+@ X@ R- tv- b++ DI+++ D+ G--@ e+>++>++++ h(++)>+ r*>? z?
------END GEEK CODE BLOCK------
decode: http://www.ebb.org/ungeek/ about: http://www.geekcode.com/geek.html