Discussion:
How do I prevent wget from creating index.html?C=M;O=A ?
Evert Meulie
2005-11-03 10:11:52 UTC
Permalink
Hi all!

Since I couldn't find this in the FAQ, I'm hoping someone here ca help me:

I'm using wget to mirror (part of) a site. This site contains a couple of directories which do not have a index.html in them, just a bunch of various files. When wget hits this dir, it creates:
index.html
index.html?C=M;O=A
index.html?C=M;O=D
index.html?C=N;O=A
index.html?C=N;O=D
index.html?C=S;O=A
index.html?C=S;O=D

How do I prefer wget from doing so? I'm currently using the following:
wget -np -nH --cut-dirs=3 --mirror http://some.domain.com/folder/folder/folder/folder



Regards,
Evert
Tobias Tiederle
2005-11-07 09:31:50 UTC
Permalink
Post by Evert Meulie
I'm using wget to mirror (part of) a site. This site contains a couple
of directories which do not have a index.html in them, just a bunch of
index.html
index.html?C=M;O=A
index.html?C=M;O=D
index.html?C=N;O=A
index.html?C=N;O=D
index.html?C=S;O=A
index.html?C=S;O=D
It seems your server is configured to send a directory listing if no
index.html is found.
By the looks of it the listing is sortable (Modified/Name/Size
Ascending/Descending).
Post by Evert Meulie
wget -np -nH --cut-dirs=3 --mirror
http://some.domain.com/folder/folder/folder/folder
If you want to get the files in this directory, I think you have to live
with them.
Otherwise it should suffice to use --exclude to exclude the directory.

Regards TT
Evert Meulie
2005-11-07 09:47:45 UTC
Permalink
Hi!

Thanks for the reply. Since I have no control over the server from which I'm pulling the mirror AND I do not want to live with these files ( 8-) ), I was wondering whether there's a way to exclude
certain file names, so that I can exclude the index.html?* wildcard...?

Regards,
Evert
Post by Tobias Tiederle
Post by Evert Meulie
I'm using wget to mirror (part of) a site. This site contains a couple
of directories which do not have a index.html in them, just a bunch of
index.html
index.html?C=M;O=A
index.html?C=M;O=D
index.html?C=N;O=A
index.html?C=N;O=D
index.html?C=S;O=A
index.html?C=S;O=D
It seems your server is configured to send a directory listing if no
index.html is found.
By the looks of it the listing is sortable (Modified/Name/Size
Ascending/Descending).
Post by Evert Meulie
wget -np -nH --cut-dirs=3 --mirror
http://some.domain.com/folder/folder/folder/folder
If you want to get the files in this directory, I think you have to live
with them.
Otherwise it should suffice to use --exclude to exclude the directory.
Regards TT
Tobias Tiederle
2005-11-07 10:42:55 UTC
Permalink
Post by Evert Meulie
Hi!
Thanks for the reply. Since I have no control over the server from
which I'm pulling the mirror AND I do not want to live with these
files ( 8-) ), I was wondering whether there's a way to exclude
certain file names, so that I can exclude the index.html?* wildcard...?
afaik there's no way (with official releases) to do this.
I have a regex patch for 1.9.1 lying around on my system but its not
included in current wget releases (because it used pcre instead of gnu
regex/c library regex).
Last thing I heard regex support is planned for 1.11.
(If you mirror this site often, why not use a script and delete them
afterwards?)

Regards TT
Evert Meulie
2005-11-07 11:00:21 UTC
Permalink
Post by Tobias Tiederle
Post by Evert Meulie
Hi!
Thanks for the reply. Since I have no control over the server from
which I'm pulling the mirror AND I do not want to live with these
files ( 8-) ), I was wondering whether there's a way to exclude
certain file names, so that I can exclude the index.html?* wildcard...?
afaik there's no way (with official releases) to do this.
I have a regex patch for 1.9.1 lying around on my system but its not
included in current wget releases (because it used pcre instead of gnu
regex/c library regex).
Last thing I heard regex support is planned for 1.11.
(If you mirror this site often, why not use a script and delete them
afterwards?)
That is not a bad idea either! :-)

Does anyone here happen to have a script that does a recursive delete of all of these index.html?* and index.html (but ONLY if there is a index.html?* file in the same directory)? Writing a script
like this exceeds my scripting capabilities... :-/

Regards,
Evert
Evert Meulie
2005-11-08 07:58:06 UTC
Permalink
The Gentoo forum provided me with the following script, that seems to do the job:

for dir in $(find /path/to/downloads -type d); do
rm $dir/index.html?* && rm $dir/index.html
done

(from http://forums.gentoo.org/viewtopic-t-399594.html )


Regards,
Evert
Post by Evert Meulie
Post by Tobias Tiederle
Post by Evert Meulie
Hi!
Thanks for the reply. Since I have no control over the server from
which I'm pulling the mirror AND I do not want to live with these
files ( 8-) ), I was wondering whether there's a way to exclude
certain file names, so that I can exclude the index.html?* wildcard...?
afaik there's no way (with official releases) to do this.
I have a regex patch for 1.9.1 lying around on my system but its not
included in current wget releases (because it used pcre instead of gnu
regex/c library regex).
Last thing I heard regex support is planned for 1.11.
(If you mirror this site often, why not use a script and delete them
afterwards?)
That is not a bad idea either! :-)
Does anyone here happen to have a script that does a recursive delete of
all of these index.html?* and index.html (but ONLY if there is a
index.html?* file in the same directory)? Writing a script like this
exceeds my scripting capabilities... :-/
Regards,
Evert
Alan.Hall
2005-11-08 13:42:02 UTC
Permalink
A less resource intensive solution might be:

find . -name "index*.html" |xargs rm

Alan.
Post by Evert Meulie
for dir in $(find /path/to/downloads -type d); do
rm $dir/index.html?* && rm $dir/index.html
done
(from http://forums.gentoo.org/viewtopic-t-399594.html )
Regards,
Evert
Post by Evert Meulie
Post by Tobias Tiederle
Post by Evert Meulie
Hi!
Thanks for the reply. Since I have no control over the server from
which I'm pulling the mirror AND I do not want to live with these
files ( 8-) ), I was wondering whether there's a way to exclude
certain file names, so that I can exclude the index.html?*
wildcard...?
afaik there's no way (with official releases) to do this.
I have a regex patch for 1.9.1 lying around on my system but its not
included in current wget releases (because it used pcre instead of
gnu regex/c library regex).
Last thing I heard regex support is planned for 1.11.
(If you mirror this site often, why not use a script and delete them
afterwards?)
That is not a bad idea either! :-)
Does anyone here happen to have a script that does a recursive delete
of all of these index.html?* and index.html (but ONLY if there is a
index.html?* file in the same directory)? Writing a script like this
exceeds my scripting capabilities... :-/
Regards,
Evert
Evert Meulie
2005-11-08 13:51:24 UTC
Permalink
But that would also wipe out all legitimate index.html files, right? ;-)

Evert
Post by Alan.Hall
find . -name "index*.html" |xargs rm
Alan.
Post by Evert Meulie
for dir in $(find /path/to/downloads -type d); do
rm $dir/index.html?* && rm $dir/index.html
done
(from http://forums.gentoo.org/viewtopic-t-399594.html )
Regards,
Evert
Post by Evert Meulie
Post by Tobias Tiederle
Post by Evert Meulie
Hi!
Thanks for the reply. Since I have no control over the server from
which I'm pulling the mirror AND I do not want to live with these
files ( 8-) ), I was wondering whether there's a way to exclude
certain file names, so that I can exclude the index.html?*
wildcard...?
afaik there's no way (with official releases) to do this.
I have a regex patch for 1.9.1 lying around on my system but its not
included in current wget releases (because it used pcre instead of
gnu regex/c library regex).
Last thing I heard regex support is planned for 1.11.
(If you mirror this site often, why not use a script and delete them
afterwards?)
That is not a bad idea either! :-)
Does anyone here happen to have a script that does a recursive delete
of all of these index.html?* and index.html (but ONLY if there is a
index.html?* file in the same directory)? Writing a script like this
exceeds my scripting capabilities... :-/
Regards,
Evert
Alan.Hall
2005-11-08 14:08:59 UTC
Permalink
So would the one in the for loop. I just used the ".", but you could do
the same path as in the for loop:

find /path/to/downloads -name "index*.html" |xargs rm

The problem with specifying the /path/to/downloads is that if the
contents are very large, some systems will error with the "The parameter
list is too long" error. You might have to tweak it a bit on your
system to figure out which works best.

Alan.
Post by Evert Meulie
But that would also wipe out all legitimate index.html files, right? ;-)
Evert
Post by Alan.Hall
find . -name "index*.html" |xargs rm
Alan.
Post by Evert Meulie
for dir in $(find /path/to/downloads -type d); do
rm $dir/index.html?* && rm $dir/index.html
done
(from http://forums.gentoo.org/viewtopic-t-399594.html )
Regards,
Evert
Post by Evert Meulie
Post by Tobias Tiederle
Post by Evert Meulie
Hi!
Thanks for the reply. Since I have no control over the server
from which I'm pulling the mirror AND I do not want to live with
these files ( 8-) ), I was wondering whether there's a way to
exclude certain file names, so that I can exclude the
index.html?* wildcard...?
afaik there's no way (with official releases) to do this.
I have a regex patch for 1.9.1 lying around on my system but its
not included in current wget releases (because it used pcre
instead of gnu regex/c library regex).
Last thing I heard regex support is planned for 1.11.
(If you mirror this site often, why not use a script and delete
them afterwards?)
That is not a bad idea either! :-)
Does anyone here happen to have a script that does a recursive
delete of all of these index.html?* and index.html (but ONLY if
there is a index.html?* file in the same directory)? Writing a
script like this exceeds my scripting capabilities... :-/
Regards,
Evert
Oliver Schulze L.
2005-11-08 15:17:13 UTC
Permalink
This line:
rm $dir/index.html?* && rm $dir/index.html
says:
If (the files $dir/index.html?* exists ) then
delete $dir/index.html?*
delete $dir/index.html
end if

HTH
Oliver
Post by Alan.Hall
So would the one in the for loop. I just used the ".", but you could
find /path/to/downloads -name "index*.html" |xargs rm
The problem with specifying the /path/to/downloads is that if the
contents are very large, some systems will error with the "The
parameter list is too long" error. You might have to tweak it a bit
on your system to figure out which works best.
Alan.
Post by Evert Meulie
But that would also wipe out all legitimate index.html files, right?
;-)
Evert
Post by Alan.Hall
find . -name "index*.html" |xargs rm
Alan.
Post by Evert Meulie
for dir in $(find /path/to/downloads -type d); do
rm $dir/index.html?* && rm $dir/index.html
done
(from http://forums.gentoo.org/viewtopic-t-399594.html )
Regards,
Evert
Post by Evert Meulie
Post by Tobias Tiederle
Post by Evert Meulie
Hi!
Thanks for the reply. Since I have no control over the server
from which I'm pulling the mirror AND I do not want to live with
these files ( 8-) ), I was wondering whether there's a way to
exclude certain file names, so that I can exclude the
index.html?* wildcard...?
afaik there's no way (with official releases) to do this.
I have a regex patch for 1.9.1 lying around on my system but its
not included in current wget releases (because it used pcre
instead of gnu regex/c library regex).
Last thing I heard regex support is planned for 1.11.
(If you mirror this site often, why not use a script and delete
them afterwards?)
That is not a bad idea either! :-)
Does anyone here happen to have a script that does a recursive
delete of all of these index.html?* and index.html (but ONLY if
there is a index.html?* file in the same directory)? Writing a
script like this exceeds my scripting capabilities... :-/
Regards,
Evert
--
Oliver Schulze L.
<***@samera.com.py>
Anton J. Gamel
2007-02-12 06:41:19 UTC
Permalink
Hi Evert
Post by Evert Meulie
Post by Tobias Tiederle
Post by Evert Meulie
that I can exclude the index.html?* wildcard...?
afaik there's no way (with official releases) to do this.
(If you mirror this site often, why not use a script and delete them
afterwards?)
That is not a bad idea either!
Does anyone here happen to have a script that does a recursive delete of all
of these index.html?* and
Post by Evert Meulie
index.html (but ONLY if there is a index.html?* file in the same directory)?
Writing a script like this exceeds my scripting capabilities... :-/
I have the same "problem" but I decided not to bother with the
index.html themselves - only using

find /mirrĂ³r/tree -name "*C=[DMNS];O=[AD]" -exec rm -f "{}" \;

Greetings

Anton

--
nice forum btw. ;-)

Loading...