Discussion:
[Bug-wget] [bug #52705] HTML assets embedding with --page-requisites
anonymous
2017-12-20 10:16:36 UTC
Permalink
URL:
<http://savannah.gnu.org/bugs/?52705>

Summary: HTML assets embedding with --page-requisites
Project: GNU Wget
Submitted by: None
Submitted on: Wed 20 Dec 2017 10:16:35 AM UTC
Category: Feature Request
Severity: 3 - Normal
Priority: 5 - Normal
Status: None
Privacy: Public
Assigned to: None
Originator Name: Artur Shayhutvinov
Originator Email: ***@yandex.ru
Open/Closed: Open
Discussion Lock: Any
Release: None
Operating System: None
Reproducibility: None
Fixed Release: None
Planned Release: None
Regression: None
Work Required: None
Patch Included: None

_______________________________________________________

Details:

It would be great to have an option that enforces wget to save all page assets
(images, styles, scripts) embedded in one html file.

Old proprietary MHTML format was very useful to save/share copies of articles
from internet, but now it isn't supported by most browsers anymore. I was
replacing this need with printing pages to PDF documents but it's not perfect
way because some sites looks broken in print mode and others may loose
important parts of content. Since HTML standard supports inline images the
problem can be solved just through pretty simple postprocessing.

Sorry for bad English. Thanks.





_______________________________________________________

Reply to this item at:

<http://savannah.gnu.org/bugs/?52705>

_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
Dale Worley
2017-12-21 12:15:39 UTC
Permalink
Follow-up Comment #1, bug #52705 (project wget):

I believe that wget can save a page and all its assets into a directory
structure, which can be archived in a single file in many ways.

Are there good, compatible ways to save all page assets embedded into one HTML
file?


_______________________________________________________

Reply to this item at:

<http://savannah.gnu.org/bugs/?52705>

_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
Darshit Shah
2017-12-21 12:59:46 UTC
Permalink
Follow-up Comment #2, bug #52705 (project wget):

While MHTML was a convenient way to create snapshots of pages, sadly it was
never properly standardized and most popular browsers no longer support it.

WARC has been almost standardized and is considered the de-facto way of
archiving a web page / web site.

Wget supports saving into the WARC format. So you may want to look into using
that.

Else, implementing MHTML should not be too hard. Just some postprocessing code
in all the places where WARC data is stored. However, none of the developers
currently have time to work on a new feature. So, if you could write a patch,
we might review and accept it.

Implementing this as a plugin for Wget would however be easier and cleaner.

_______________________________________________________

Reply to this item at:

<http://savannah.gnu.org/bugs/?52705>

_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
Darshit Shah
2018-11-13 00:27:08 UTC
Permalink
Update of bug #52705 (project wget):

Status: None => Wont Fix
Open/Closed: Open => Closed


_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?52705>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

Loading...