Discussion:
Can wget remove local files that are no longer on the remote server?
Heiko Selber
2004-09-07 21:03:01 UTC
Permalink
Hello,

I use wget to mirror the contents of a remote directory (containing
patches for SuSE Linux, if you want to know the details).

It works quite well, but I can't find an option that makes wget remove
files locally that are no longer on the server.

Example: If the file foo-1.2.3-45.rpm is replaced by foo-1.2.3-46.rpm,
wget happily downloads the new file, but the old one remains locally.

For now, I have to remove the old files semi-automagically to avoid
cramming the disk.

How do I tell wget to remove them? (IMHO, this should be part of
mirroring.)

TIA,

Heiko

PS: I use wget-1.9.1

PPS: The web site http://wget.sunsite.dk says that the mailing list is
open to non-subscribers. Apparently, this is not true. I tried to send
an email to the list before I subscribed, but it didn't get through.
Justin Gombos
2004-09-09 00:27:22 UTC
Permalink
Post by Heiko Selber
Hello,
Example: If the file foo-1.2.3-45.rpm is replaced by foo-1.2.3-46.rpm,
wget happily downloads the new file, but the old one remains locally.
For now, I have to remove the old files semi-automagically to avoid
cramming the disk.
Ditto. Wget is lacking trash cleanup. In addition to the case you
mention, there are also cases where a link to a file might be removed,
but the file remains. Without researching whether a file is linked to
from another part of the html heirchy, admins are tempted to leave
trash around. A tool that discovers unlinked files and offers to
delete them would be quite useful.
Heiko Selber
2004-09-22 20:49:31 UTC
Permalink
Post by Heiko Selber
I use wget to mirror the contents of a remote directory (containing
patches for SuSE Linux, if you want to know the details).
It works quite well, but I can't find an option that makes wget remove
files locally that are no longer on the server.
Example: If the file foo-1.2.3-45.rpm is replaced by foo-1.2.3-46.rpm,
wget happily downloads the new file, but the old one remains locally.
For now, I have to remove the old files semi-automagically to avoid
cramming the disk.
I wrote a workaround outside of wget: a python script that recurses
through a directory and removes all local files that are not in
'.listing' after wget is done.

This means of course that it works only for ftp URLs and with the option
-nr "don't remove listing" (or -m).

If anybody is interested:

The python script is attached at the bottom and is used like in this
example:

#### BEGIN mirror-suse-patches.bat ####
REM Mirror the SuSE 9.1 patches to a windoze PC
REM needs python 2.3 from www.python.org
wget -o wget.log -nH -m -X /suse/i386/update/9.1/rpm/src
ftp://ftp.suse.com/pub/suse/i386/update/9.1/
python cleanup-wget.py suse >> wget.log
#### END mirror-suse-patches.bat ####

NOTE: USE AT OWN RISK!

The script is pretty crude and most likely doesn't work for you. It
assumes that the file names in the listing start at column 56, and it
cannot deal with symlinks. I tested it only for the batch file above.

You have been warned.

Regards,

Heiko

#### BEGIN cleanup-wget.py ####
"""
Clean up a directory downloaded with wget.

The problem is that wget doesn't remove local files
if they no longer exist remotely. This leads to a pileup of old files
and a waste of disk space.

This script scans the files '.listing' in each directory and removes all
files
(except '.listing') that are not there.

This works only for ftp downloads (otherwise there would be no '.listing').
"""

def fileNameFromLine(line):
"""
Extract a file name from a line of listing.

It is assumed that the file name starts at the 56th character.
This is probably a bit crude, because the listing may be formatted
in a different way. For the moment, I don't have examples of other
.listings, so let it be.
"""

# truncate the first 55 characters
fileName=line[55:]

# remove a trailing '\n' (which is a whitespace character)
fileName=fileName.rstrip()

return fileName

def parseListing(filename):
"""
Open a file and extract a filename from each line,
returning a list of strings.
"""

listingFile=file(filename,'r')

returnList=[]

while True:
line=listingFile.readline()
if line=='':
break
returnList.append(fileNameFromLine(line))

listingFile.close()

return returnList

import os
import sys

def recurse(dir):
print 'entering',dir
# get all files in local directory
localFileList=os.listdir(dir)
# go through local file list
for filename in localFileList:
subdir=dir+os.sep+filename
# treat all subdirectories
if os.path.isdir(subdir):
recurse(subdir)

# if there is a .listing
if '.listing' in localFileList:
# extract a list of remote files from it
remoteFileList=parseListing(dir+os.sep+'.listing')
# remove all local files that have no remote counterpart
for listFile in localFileList:
if not listFile in remoteFileList:
if listFile!='.listing':
print 'removing',listFile
os.remove(dir+os.sep+listFile)

if __name__=='__main__':
if 2 == len(sys.argv):
dir=sys.argv[1]
print
print 'removing obsolete local files...'
recurse(dir)
else:
print
print 'usage:',sys.argv[0],'<directory name>'
#### END cleanup-wget.py ####
Dillonco
2004-11-03 05:38:05 UTC
Permalink
Post by Heiko Selber
Hello,
I use wget to mirror the contents of a remote directory (containing
patches for SuSE Linux, if you want to know the details).
It works quite well, but I can't find an option that makes wget remove
files locally that are no longer on the server.
Yes, this is very annoying.

I found this online:
http://mrmt.net/linux/wget.html
However, I have not tried it. It is for an older version, and I have no
experience with patches. However, it seems amazingingly easy. I do wonder
why it's not implemented...

Even though you kind of solved your problem, hopeful this will help someone
(or a developer will see it and add it).
Mauro Tortonesi
2004-11-06 18:37:50 UTC
Permalink
Post by Dillonco
Post by Heiko Selber
Hello,
I use wget to mirror the contents of a remote directory (containing
patches for SuSE Linux, if you want to know the details).
It works quite well, but I can't find an option that makes wget remove
files locally that are no longer on the server.
Yes, this is very annoying.
http://mrmt.net/linux/wget.html
However, I have not tried it. It is for an older version, and I have no
experience with patches. However, it seems amazingingly easy. I do wonder
why it's not implemented...
Even though you kind of solved your problem, hopeful this will help someone
(or a developer will see it and add it).
you're rigth. we should definitely merge this code in wget. does anyone of you
have the email address of the original developer? or maybe you are interested
in providing your own implementation and posting the code on the wget-patches
list?
--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi

University of Ferrara - Dept. of Eng. http://www.ing.unife.it
Institute of Human & Machine Cognition http://www.ihmc.us
Deep Space 6 - IPv6 for Linux http://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it
Dillonco
2004-11-06 19:45:42 UTC
Permalink
Post by Mauro Tortonesi
Post by Dillonco
Post by Heiko Selber
Hello,
I use wget to mirror the contents of a remote directory (containing
patches for SuSE Linux, if you want to know the details).
It works quite well, but I can't find an option that makes wget remove
files locally that are no longer on the server.
Yes, this is very annoying.
http://mrmt.net/linux/wget.html
However, I have not tried it. It is for an older version, and I have no
experience with patches. However, it seems amazingingly easy. I do
wonder why it's not implemented...
Even though you kind of solved your problem, hopeful this will help
someone (or a developer will see it and add it).
you're rigth. we should definitely merge this code in wget. does anyone of
you have the email address of the original developer? or maybe you are
interested in providing your own implementation and posting the code on
the wget-patches list?
The only email address I found is "morimoto AATT xantia.citroen.org", but I
don't know if that person it the developer (or can speak much english).

As for my own implementation, I can't imagine a way to do this that would be
much simpler. Anything I would code would be almost exactly the same
supposing wget hasn't changed too much (something I doubt). However, if
you'd like me to, I certainly could.

Loading...