Cool hack with wget and xmllint
I'm rather proud of a cool hack I created through the combined use of the venerable utilities wget and xmllint.
![]() Eye Candy by Eric |
But since software is never done, the system was lacking. More specifically, when I wrote my publishing system RSS feeds did not include content, just metadata. Since then an extended element was added to the RSS namespace, specifically one called "content". [2] This namespace allows a publisher to include HTML in their syndication but with two caveats: 1) only the true content of an HTML file is included in the syndication, meaning nothing from the HTML head element, and 2) no relative URLs are allowed because if they were, then all the URLs would be broken. ("Duh!") Consequently, if I wanted my content to be truly syndicated, then would need to enhance my RSS feed generator.
This is where wget and xmllint make the scene. Given a URL, wget will... get the content at the other end of the URL, and as an added bonus and through the combined use of the -k and -O switches, wget will also tranform all relative URLs of a cached HTML file into absolute URLs. [3] Very nice. Thus, Issue #2, above, can be resolved. To resolve Issue #1, I know that my returned HTML is well-formed, and consequently I can extract the desired content through the use of an XPath statement. Given this XPath statement, xmllint can return the desired content. [4] For a good time, I can also use xmllint to reformat the output into a nicely formatted hierarchical structure. Finally, because both of these utilities support I/O through standard input and standard output, they can be glued together with a few tiny (Bash) commands:
# configure URL="http://infomotions.com/musings/my-ide/" TMP="/tmp/file.html" XPATH='/html/body/div/div/div'# do the work CONTENT=$( wget -qkO "$TMP" "$URL"; cat "$TMP" | xmllint --xpath "$XPATH" - | xmllint --format - | tail -n +2 )
Very elegant. The final step is/was to tranlate the Bash commands into Perl code and thus incorporate the hack into my RSS generator. "Voila!"
Again, software is never done, and if it were, then it would be called "hardware"; software requires maintenance, and after a while the maintenance can become more expensive than the development. It is very satisfying when maintenance is so inexpensive compared to development. Jettisoning WordPress was the right thing for me to do, especially considering the costs -- tiny.
Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Source: This file was originally posted as a part of Infomotions Musings
Date created: 2020-12-27
Date updated: 2020-12-27
Subject(s): publishing;
URL: http://infomotions.com/musings/wget-and-xmllint/