Tsukasa OI
2018-05-03 10:00:54 UTC
URL:
<http://savannah.gnu.org/bugs/?53818>
Summary: Proposal: Check HTML suffix (for TEXTHTML flag) also
on unchanged files
Project: GNU Wget
Submitted by: a4lg
Submitted on: Thu 03 May 2018 07:00:52 PM JST
Category: Program Logic
Severity: 3 - Normal
Priority: 5 - Normal
Status: None
Privacy: Public
Assigned to: None
Originator Name:
Originator Email:
Open/Closed: Open
Discussion Lock: Any
Release: None
Operating System: GNU/Linux
Reproducibility: Every Time
Fixed Release: None
Planned Release: None
Regression: No
Work Required: None
Patch Included: Yes
_______________________________________________________
Details:
Version: 1.19.4
If both `-r' (recursive) and `-N' (check timestamp) options are given and the
server returns 304 (Not Modified), the HTML file (already downloaded) is not
considered as a HTML file and links in the HTML file are not followed.
If we want to (periodically) backup some website (all pages are linked from
index.html directly or indirectly) to track some changes while avoiding
unnecessary downloads, we naturally use `-N' option. However, if some "leaf"
pages are changed but index.html is unchanged, we could miss some important
changes.
I hate this behavior (`-nc' option mostly works because it guesses HTML file
by its file name suffix but `-N' doesn't) so I decided to propose a small
change.
The attached patch reuses `get_file_flags` (which guesses HTML file by file
name suffix *when -nc (no clobber) option is given*) if the server returns 304
(Not Modified).
Note that:
0 This patch slightly changes Wget's behavior.
0 It makes a caveat similar to bug #50935. If solution to bug #50935 is
invented, it can be (and should be) applied to this.
0 I (as author) consider this patch is too small to be copyrighted.
I tested the patch but I'm not sure whether this patch is suitable for
upstream merge. I consider this as _improvement_ but you may consider I
_broke_ the behavior.
Please let me know if you have any feedback about this.
_______________________________________________________
File Attachments:
-------------------------------------------------------
Date: Thu 03 May 2018 07:00:52 PM JST Name:
0001-Check-HTML-suffix-also-on-unchanged-files.patch Size: 2KiB By: a4lg
<http://savannah.gnu.org/bugs/download.php?file_id=44069>
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/bugs/?53818>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
<http://savannah.gnu.org/bugs/?53818>
Summary: Proposal: Check HTML suffix (for TEXTHTML flag) also
on unchanged files
Project: GNU Wget
Submitted by: a4lg
Submitted on: Thu 03 May 2018 07:00:52 PM JST
Category: Program Logic
Severity: 3 - Normal
Priority: 5 - Normal
Status: None
Privacy: Public
Assigned to: None
Originator Name:
Originator Email:
Open/Closed: Open
Discussion Lock: Any
Release: None
Operating System: GNU/Linux
Reproducibility: Every Time
Fixed Release: None
Planned Release: None
Regression: No
Work Required: None
Patch Included: Yes
_______________________________________________________
Details:
Version: 1.19.4
If both `-r' (recursive) and `-N' (check timestamp) options are given and the
server returns 304 (Not Modified), the HTML file (already downloaded) is not
considered as a HTML file and links in the HTML file are not followed.
If we want to (periodically) backup some website (all pages are linked from
index.html directly or indirectly) to track some changes while avoiding
unnecessary downloads, we naturally use `-N' option. However, if some "leaf"
pages are changed but index.html is unchanged, we could miss some important
changes.
I hate this behavior (`-nc' option mostly works because it guesses HTML file
by its file name suffix but `-N' doesn't) so I decided to propose a small
change.
The attached patch reuses `get_file_flags` (which guesses HTML file by file
name suffix *when -nc (no clobber) option is given*) if the server returns 304
(Not Modified).
Note that:
0 This patch slightly changes Wget's behavior.
0 It makes a caveat similar to bug #50935. If solution to bug #50935 is
invented, it can be (and should be) applied to this.
0 I (as author) consider this patch is too small to be copyrighted.
I tested the patch but I'm not sure whether this patch is suitable for
upstream merge. I consider this as _improvement_ but you may consider I
_broke_ the behavior.
Please let me know if you have any feedback about this.
_______________________________________________________
File Attachments:
-------------------------------------------------------
Date: Thu 03 May 2018 07:00:52 PM JST Name:
0001-Check-HTML-suffix-also-on-unchanged-files.patch Size: 2KiB By: a4lg
<http://savannah.gnu.org/bugs/download.php?file_id=44069>
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/bugs/?53818>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/