
PCRE based URL rewriting feature added to GNU wget
- or -
Post a project like this1977
£50(approx. $67)
- Posted:
- Proposals: 3
- Remote
- #2936091
- PRE-FUNDED
- Awarded
Description
Experience Level: Entry
Estimated project duration: less than 1 week
I'm looking for a new feature to be added to GNU Wget which will alter URLs before they are added to the download queue, based on user-provided PCRE (Perl Compatible Regular Expression) patterns.
The feature should support multiple PCREs, passed via command line arguments (and optionally via another mechanisms such as input file).
Example Syntax:
wget --no-clobber --recursive --page-requisites \
--url-replace='s/&foo=\d//g' \
--url-replace='s/AAA/BBB/gi' \
--url-replace='s/(a[0-9])\.bkp\.html$/\1.BACKUP.html/' \
"http://site.tld"
[ Note that "--url-replace" does not currently exist, and is there to demonstrate the feature desired ]
Given the above command, Wget should apply the patterns given with the "--url-replace" arguments early enough such that checks for the corresponding local file already existing (i.e. --no-clobber) are based on the post-replacement URL's value.
A Worked Example:
When recursively downloading / mirroring from the start page "https://site.tld/", using the "Example Syntax" command above:
- Wget finds the link "https://site.tld/aaa.html?a=1&foo=2&bar=3".
- This link is converted into "https://site.tld/BBB.html?a=1&bar=3" after applying the '--url-replace' patterns above (NB: there is a /gi modifier on the second pattern).
- If the local file "./site.tld/BBB.html?a=1&bar=3" already exists, the content will not be downloaded. Otherwise, wget will fetch it as normal.
- If Wget later finds a link to "https://site.tld/a3.bkp.html", this URL is converted to "https://site.tld/a3.BACKUP.html" before being downloaded and saved at "./site.tld/a3.BACKUP.html" (again, assuming that ./site.tld/a3.BACKUP.html does not already exist)
NOTE: If you can suggest another approach to the one above which achieves the same objective and fits within the project budget, I'm willing to consider it. As an example, passing URLs for changes via external script before adding to the queue might be a reasonable approach.
Delivery requirements:
- You will add the feature to the existing code available from https://ftp.gnu.org/gnu/wget/wget-latest.tar.gz (currently: 1.20.3)
- You will provide a zip file of the complete source code with the new feature added, and a patch file which can be applied the original (clean) wget-1.20.3 source code.
- The updated code will compile cleanly on a Fedora 32/CentOS 8 system with normal wget dependencies already installed via DNF/yum (pcre/pcre2/gnutls/nettle/zlib/libidn2)
- The usual commands "make dist clean; ./configure --enable-pcre; make" will work as expected, leaving a ready-to-execute "wget" binary under ./src/wget.
The feature should support multiple PCREs, passed via command line arguments (and optionally via another mechanisms such as input file).
Example Syntax:
wget --no-clobber --recursive --page-requisites \
--url-replace='s/&foo=\d//g' \
--url-replace='s/AAA/BBB/gi' \
--url-replace='s/(a[0-9])\.bkp\.html$/\1.BACKUP.html/' \
"http://site.tld"
[ Note that "--url-replace" does not currently exist, and is there to demonstrate the feature desired ]
Given the above command, Wget should apply the patterns given with the "--url-replace" arguments early enough such that checks for the corresponding local file already existing (i.e. --no-clobber) are based on the post-replacement URL's value.
A Worked Example:
When recursively downloading / mirroring from the start page "https://site.tld/", using the "Example Syntax" command above:
- Wget finds the link "https://site.tld/aaa.html?a=1&foo=2&bar=3".
- This link is converted into "https://site.tld/BBB.html?a=1&bar=3" after applying the '--url-replace' patterns above (NB: there is a /gi modifier on the second pattern).
- If the local file "./site.tld/BBB.html?a=1&bar=3" already exists, the content will not be downloaded. Otherwise, wget will fetch it as normal.
- If Wget later finds a link to "https://site.tld/a3.bkp.html", this URL is converted to "https://site.tld/a3.BACKUP.html" before being downloaded and saved at "./site.tld/a3.BACKUP.html" (again, assuming that ./site.tld/a3.BACKUP.html does not already exist)
NOTE: If you can suggest another approach to the one above which achieves the same objective and fits within the project budget, I'm willing to consider it. As an example, passing URLs for changes via external script before adding to the queue might be a reasonable approach.
Delivery requirements:
- You will add the feature to the existing code available from https://ftp.gnu.org/gnu/wget/wget-latest.tar.gz (currently: 1.20.3)
- You will provide a zip file of the complete source code with the new feature added, and a patch file which can be applied the original (clean) wget-1.20.3 source code.
- The updated code will compile cleanly on a Fedora 32/CentOS 8 system with normal wget dependencies already installed via DNF/yum (pcre/pcre2/gnutls/nettle/zlib/libidn2)
- The usual commands "make dist clean; ./configure --enable-pcre; make" will work as expected, leaving a ready-to-execute "wget" binary under ./src/wget.
Alex F.
100% (1)Projects Completed
1
Freelancers worked with
1
Projects awarded
100%
Last project
8 Aug 2020
United Kingdom
New Proposal
Login to your account and send a proposal now to get this project.
Log inClarification Board Ask a Question
-
There are no clarification messages.
We collect cookies to enable the proper functioning and security of our website, and to enhance your experience. By clicking on 'Accept All Cookies', you consent to the use of these cookies. You can change your 'Cookies Settings' at any time. For more information, please read ourCookie Policy
Cookie Settings
Accept All Cookies