Discussion:
[Bug-wget] Check external reference, but don't process further
Fernando Gont
2018-11-27 11:20:45 UTC
Permalink
Folks,

I'm using wget in a script to check for broken links in a web site,
which uses the "--spider" mode.

I'd like wget to operate in recursive mode for pages in the target
domain, but not for pages in other hosts/sites.

That is, if I'm crawling www.example.com, I'd like wget to process all
pages in that domain recursively. However, if there's a link to an
external site, I just want wget to check that URL, but not process that
external reference recursively.

"-D" would seem to prevent checking external references, so I cannot use
it. And "--level" would mean that pages on external sites my still be
processed recursively.

Any advice on how to implement this?

Thanks!

Cheers,
Fernando
--
Fernando Gont
SI6 Networks
e-mail: ***@si6networks.com
PGP Fingerprint: 6666 31C6 D484 63B2 8FB1 E3C4 AE25 0D55 1D4E 7492
Darshit Shah
2018-11-27 12:30:25 UTC
Permalink
Hi Fernando,

As far as I'm aware there is no way to limit the recursion depth only on
foreign hosts. Something like this would definitely be a lot easier to do using
Wget2 which offers a few more powerful tools that Wget does. Wget2's alpha is
currently available in the Debian repositories and Arch Linux's AUR.

If you'd still like to continue using Wget, one way to pull this off would be
to have Wget print its debug output and then parse that to extract all the URIs
on foreign hosts. You can then have a second invokation of Wget to test for
their existence. An example of doing this would be:

$ wget -r --spider -d exmaple.com | grep -B1 "This is not the same hostname as the parent's" | grep "Deciding whether to enqueue" | sed 's/.*\"\(.*\)\"\./\1/g' | wget --spider -i-

Of course, you may want to modify this to meet your own needs, but the general
idea should work for you
Post by Fernando Gont
Folks,
I'm using wget in a script to check for broken links in a web site,
which uses the "--spider" mode.
I'd like wget to operate in recursive mode for pages in the target
domain, but not for pages in other hosts/sites.
That is, if I'm crawling www.example.com, I'd like wget to process all
pages in that domain recursively. However, if there's a link to an
external site, I just want wget to check that URL, but not process that
external reference recursively.
"-D" would seem to prevent checking external references, so I cannot use
it. And "--level" would mean that pages on external sites my still be
processed recursively.
Any advice on how to implement this?
Thanks!
Cheers,
Fernando
--
Fernando Gont
SI6 Networks
PGP Fingerprint: 6666 31C6 D484 63B2 8FB1 E3C4 AE25 0D55 1D4E 7492
--
Thanking You,
Darshit Shah
PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6
Loading...