[Bug-wget] Wget follows "button" links

Tim Rühsen

2018-06-05 14:37:57 UTC

Hi,

"Both --no-clobber and --convert-links were specified, only

--convert-links will be used."

Right, I missed that. The combination of both flags was buggy by design
(also in 1.12) and suffered from several flaws (not to say bugs).

Regex more like '.*/xpage=watch.*'. The exact syntax depends on
--regex-type=TYPE regex type (posix|pcre)

What else can you do... try wget2. It allows the combination of
--no-clobber and --convert-links. And if you find bugs they can be fixed
(other as wget1.x were we have to redesign a whole lot of things).

See https://gitlab.com/gnuwget/wget2

If you don't like to build from git, you can download a pretty recent
tarball from https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz.

Signature at https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz.sig

Regards, Tim

Hey Tim,
Please see http://savannah.gnu.org/bugs/?31781 where it implemented. Since version 1.12.1.
"Both --no-clobber and --convert-links were specified, only --convert-links will be used."
As a response.
Anyway, I might make due without -nc if I can use the regex argument. Could you give an example on how would that argument work in my case? Can I just use www.mywiki.com/delete/* as an argument for example? or .*/xpage=watch.* ?
Thanks!
âSent with ProtonMail Secure Email.â
âââââââ Original Message âââââââ

Hi,
in this case you could try it with -X / --exclude-directories.
E.g. wget -X /delete,/remove
That wouldn't help with "xpage=watch..." though.
And I can't tell you if and how good -X works with wget 1.12.
Why (or since when) doesn't --no-clobber plus --convert-links work any
more ?
Please feel free to open a bug report at
https://savannah.gnu.org/bugs/?func=additem&group=wget with a detailed
description, please.
Cause it works for me :-)
Regards, Tim

Hey Tim,
Thanks for the info. The wiki software we use (xwiki) appends something to wiki pages URLs to express a certain behavior. For example, to "watch" a page, the button once pressed redirects you to "www.wiki.com/WIKI-PAGE-NAME?xpage=watch&do=adddocument"
Where the only thing that changes is the "WIKI-PAGE-NAME" part.
Also, for actions such as like "deleting" or "reverting" a wiki page, the URL changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these are usually in the middle, before the actual page name. For example: www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is in the middle of the actual wiki page URL.
What I would need to do is exclude from wget visiting any www.wiki.com/delete or www.wiki.com/remove/ pages. I'd also need to exclude links that end with "xpage=watch&do=adddocument" which triggers me to watch that page.
I am using v1.12 because the most recent versions have disabled --no-clobber and --convert-links from working together. I need --no-clobber because if the download stops, I need to be able to resume without re-downloading all the files. And I need --convert-links because this needs to work as a local copy.
From my understanding the options you mention have been added after v1.12. Is there any way to achieve this?
BTW, -N (timestamps) doesn't work, as the server on which the wiki is hosted doesn't seem to support this, hence wget keeps redownloading the same files.
Thanks a lot!
âââââââ Original Message âââââââ

Hey there,
wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" --user=myuser --ask-password --no-check-certificate --recursive --page-requisites --adjust-extension --span-hosts --restrict-file-names=windows --domains wiki.com --no-parent wiki.com --no-clobber --convert-links --wait=0 --quota=inf -P /home/W
To download a wiki. The problem is that this will follow "button" links, e.g the links that allow a user to put a page on a watchlist for further modifications. This has led to me watching hundreds of pages. Not only that, but apparently it also follows the links that lead to reverting changes made by others on a page.
Is there a way to avoid this behavior?

Hi,
that depends on how these "button links" are realized.
A button may be part of a HTML FORM tag/structure where the URL is the
value of the 'action' attribute. Wget doesn't download such URLs because
of the problem you describe.
A dynamic web page can realize "button links" by using simple links.
Wget doesn't know about hidden semantics and so downloads these URLs -
and maybe they trigger some changes in a database.
If this is your issue, you have to look into the HTML files and exclude
those URLs from being downloaded. Or you create a whitelist. Look at
options -A/-R and --accept-regex and --reject-regex.

wget --version
GNU Wget 1.12 built on linux-gnu.

Ok, you should update wget if possible. Latest version is 1.19.5.
Regards, Tim