How do I prevent wget from creating index.html?C=M;O=A ?

Post by Evert Meulie
I'm using wget to mirror (part of) a site. This site contains a couple
of directories which do not have a index.html in them, just a bunch of
index.html
index.html?C=M;O=A
index.html?C=M;O=D
index.html?C=N;O=A
index.html?C=N;O=D
index.html?C=S;O=A
index.html?C=S;O=D

It seems your server is configured to send a directory listing if no
index.html is found.
By the looks of it the listing is sortable (Modified/Name/Size
Ascending/Descending).

Post by Evert Meulie
wget -np -nH --cut-dirs=3 --mirror
http://some.domain.com/folder/folder/folder/folder

If you want to get the files in this directory, I think you have to live
with them.
Otherwise it should suffice to use --exclude to exclude the directory.

Regards TT

Evert Meulie

2005-11-07 09:47:45 UTC

Hi!

Thanks for the reply. Since I have no control over the server from which I'm pulling the mirror AND I do not want to live with these files ( 8-) ), I was wondering whether there's a way to exclude
certain file names, so that I can exclude the index.html?* wildcard...?

Regards,
Evert

It seems your server is configured to send a directory listing if no
index.html is found.
By the looks of it the listing is sortable (Modified/Name/Size
Ascending/Descending).

Post by Evert Meulie
wget -np -nH --cut-dirs=3 --mirror
http://some.domain.com/folder/folder/folder/folder

If you want to get the files in this directory, I think you have to live
with them.
Otherwise it should suffice to use --exclude to exclude the directory.
Regards TT

Tobias Tiederle

2005-11-07 10:42:55 UTC

Post by Evert Meulie
Hi!
Thanks for the reply. Since I have no control over the server from
which I'm pulling the mirror AND I do not want to live with these
files ( 8-) ), I was wondering whether there's a way to exclude
certain file names, so that I can exclude the index.html?* wildcard...?

afaik there's no way (with official releases) to do this.
I have a regex patch for 1.9.1 lying around on my system but its not
included in current wget releases (because it used pcre instead of gnu
regex/c library regex).
Last thing I heard regex support is planned for 1.11.
(If you mirror this site often, why not use a script and delete them
afterwards?)

Regards TT

Evert Meulie

2005-11-07 11:00:21 UTC

That is not a bad idea either! :-)

Does anyone here happen to have a script that does a recursive delete of all of these index.html?* and index.html (but ONLY if there is a index.html?* file in the same directory)? Writing a script
like this exceeds my scripting capabilities... :-/

Regards,
Evert

Evert Meulie

2005-11-08 07:58:06 UTC

The Gentoo forum provided me with the following script, that seems to do the job:

for dir in $(find /path/to/downloads -type d); do
rm $dir/index.html?* && rm $dir/index.html
done

(from http://forums.gentoo.org/viewtopic-t-399594.html )

Regards,
Evert

That is not a bad idea either! :-)
Does anyone here happen to have a script that does a recursive delete of
all of these index.html?* and index.html (but ONLY if there is a
index.html?* file in the same directory)? Writing a script like this
exceeds my scripting capabilities... :-/
Regards,
Evert

Alan.Hall

2005-11-08 13:42:02 UTC

A less resource intensive solution might be:

find . -name "index*.html" |xargs rm

Alan.

Post by Evert Meulie
for dir in $(find /path/to/downloads -type d); do
rm $dir/index.html?* && rm $dir/index.html
done
(from http://forums.gentoo.org/viewtopic-t-399594.html )
Regards,
Evert

afaik there's no way (with official releases) to do this.
I have a regex patch for 1.9.1 lying around on my system but its not
included in current wget releases (because it used pcre instead of
gnu regex/c library regex).
Last thing I heard regex support is planned for 1.11.
(If you mirror this site often, why not use a script and delete them
afterwards?)

That is not a bad idea either! :-)
Does anyone here happen to have a script that does a recursive delete
of all of these index.html?* and index.html (but ONLY if there is a
index.html?* file in the same directory)? Writing a script like this
exceeds my scripting capabilities... :-/
Regards,
Evert

Evert Meulie

2005-11-08 13:51:24 UTC

But that would also wipe out all legitimate index.html files, right? ;-)

Evert

Post by Alan.Hall
find . -name "index*.html" |xargs rm
Alan.

Post by Evert Meulie
for dir in $(find /path/to/downloads -type d); do
rm $dir/index.html?* && rm $dir/index.html
done
(from http://forums.gentoo.org/viewtopic-t-399594.html )
Regards,
Evert

afaik there's no way (with official releases) to do this.
I have a regex patch for 1.9.1 lying around on my system but its not
included in current wget releases (because it used pcre instead of
gnu regex/c library regex).
Last thing I heard regex support is planned for 1.11.
(If you mirror this site often, why not use a script and delete them
afterwards?)

That is not a bad idea either! :-)
Does anyone here happen to have a script that does a recursive delete
of all of these index.html?* and index.html (but ONLY if there is a
index.html?* file in the same directory)? Writing a script like this
exceeds my scripting capabilities... :-/
Regards,
Evert

Alan.Hall

2005-11-08 14:08:59 UTC

So would the one in the for loop. I just used the ".", but you could do
the same path as in the for loop:

find /path/to/downloads -name "index*.html" |xargs rm

The problem with specifying the /path/to/downloads is that if the
contents are very large, some systems will error with the "The parameter
list is too long" error. You might have to tweak it a bit on your
system to figure out which works best.

Alan.

Post by Evert Meulie
But that would also wipe out all legitimate index.html files, right? ;-)
Evert

Post by Alan.Hall
find . -name "index*.html" |xargs rm
Alan.

Post by Evert Meulie
for dir in $(find /path/to/downloads -type d); do
rm $dir/index.html?* && rm $dir/index.html
done
(from http://forums.gentoo.org/viewtopic-t-399594.html )
Regards,
Evert

Post by Evert Meulie
Hi!
Thanks for the reply. Since I have no control over the server
from which I'm pulling the mirror AND I do not want to live with
these files ( 8-) ), I was wondering whether there's a way to
exclude certain file names, so that I can exclude the
index.html?* wildcard...?

afaik there's no way (with official releases) to do this.
I have a regex patch for 1.9.1 lying around on my system but its
not included in current wget releases (because it used pcre
instead of gnu regex/c library regex).
Last thing I heard regex support is planned for 1.11.
(If you mirror this site often, why not use a script and delete
them afterwards?)

That is not a bad idea either! :-)
Does anyone here happen to have a script that does a recursive
delete of all of these index.html?* and index.html (but ONLY if
there is a index.html?* file in the same directory)? Writing a
script like this exceeds my scripting capabilities... :-/
Regards,
Evert

Oliver Schulze L.

2005-11-08 15:17:13 UTC

This line:
rm $dir/index.html?* && rm $dir/index.html
says:
If (the files $dir/index.html?* exists ) then
delete $dir/index.html?*
delete $dir/index.html
end if

HTH
Oliver

Post by Alan.Hall
So would the one in the for loop. I just used the ".", but you could
find /path/to/downloads -name "index*.html" |xargs rm
The problem with specifying the /path/to/downloads is that if the
contents are very large, some systems will error with the "The
parameter list is too long" error. You might have to tweak it a bit
on your system to figure out which works best.
Alan.

Post by Evert Meulie
But that would also wipe out all legitimate index.html files, right?
;-)
Evert

Post by Alan.Hall
find . -name "index*.html" |xargs rm
Alan.

Post by Evert Meulie
for dir in $(find /path/to/downloads -type d); do
rm $dir/index.html?* && rm $dir/index.html
done
(from http://forums.gentoo.org/viewtopic-t-399594.html )
Regards,
Evert

Post by Evert Meulie
Hi!
Thanks for the reply. Since I have no control over the server
from which I'm pulling the mirror AND I do not want to live with
these files ( 8-) ), I was wondering whether there's a way to
exclude certain file names, so that I can exclude the
index.html?* wildcard...?

afaik there's no way (with official releases) to do this.
I have a regex patch for 1.9.1 lying around on my system but its
not included in current wget releases (because it used pcre
instead of gnu regex/c library regex).
Last thing I heard regex support is planned for 1.11.
(If you mirror this site often, why not use a script and delete
them afterwards?)

That is not a bad idea either! :-)
Does anyone here happen to have a script that does a recursive
delete of all of these index.html?* and index.html (but ONLY if
there is a index.html?* file in the same directory)? Writing a
script like this exceeds my scripting capabilities... :-/
Regards,
Evert

--
Oliver Schulze L.
<***@samera.com.py>

Anton J. Gamel

2007-02-12 06:41:19 UTC

Hi Evert