Junior Software Engineer - KE8TIZ
Tyler - 2021-07-26 12:53:51
Views - 117
So if you actually do read this site often, you may have noticed that there is now an RSS feed. Its on the main posts page, up at the top right. RSS is a very interesting technology. It was designed with the intent, it seems, to connect the whole internet in one nice syndication protocol that was easy to understand and use. And it really delivered on that! It just seems to have not caught on as much as once thought. However, I still use newsboat for most of my youtube and other feeds, and it works great. Despite being 'dead', most things support it (or you can find a tool to make it work).
With that said, it seems like it would be great to automate stuff with it, unix style. Pipes, bash scripting, the whole deal. However, I didn't really find anything that fit my needs. I just wanted a light, simple to use program that could extract things from an rss feed and spit it out, to be further processed by something like awk or something. Alas, with my searching I found nothing. So pulled up a tmux session, put on some music at full volume, and one weekend later we now have rss-cli!
rss-cli uses the very fast c++ library rapidxml to parse the RSS feed. Performance tends to be around 3ms total execution time for very large RSS feeds (~30 items) on my i7-9750H. I was getting about 10ms on a raspberry pi 4 for the same feed.
rss-cli will parse the rss file, which is identified by a URI. A URI is used, because the program uses libcurl to fetch rss feeds off the internet, however file:///some/rss/feed.rss is also valid for local files. Once the file is grabbed, it is parsed by rapidxml, then kept in memory. When a specific attribute is needed, it is fetched as needed. This lazy-loading approach keeps execution times low, as often you will not need the entire feed, you will probably only be extracting key bits of information for your next program to parse.
All of the meat of rss-cli is in the rss_utils namespace. I placed this
here, along with an rss_utils::rss object for interacting with the rss feed, so
that moving rss.cpp and rss.hpp to your own project can be as easy as possible.
rss_utils also contains a rss_utils::item, which represenets the
Both rss_utils::rss and rss_utils::item contain clone functions, the big 3, and accessor functions for all of the possible associated elements. For example, if you want to access an rss feed's title, you would call:
std::string rss_utils::rss::getTitle() const
All responses are given as std::string, to allow for the widest compatability possible. Each time one of these functions are called, it will search the document for attribute, and return an empty string (std::string("")) if nothing is found. Neither of the classes ever throw exceptions. rss_utils::rss also provides a isOk() function for checking if the rss feed was valid. If isOk() returns false, all accessor functions will return empty strings. When attempting to get items while isOk() is false, an empty std::vector<rss_utils::items> will be returned
rss-cli provides the --help flag to display all of the optiosn it will accept. There are alot of options, but this is because each option corresponds to a field in the RSS 2.0 Spec. Here is a full version of the help menu (as of 7-26-21):
Usage: rss-cli [-u FEED_URI] [CHANNEL FLAGS] [-i ITEM_INDEX] [ITEM FLAGS] Options: Required Options: [-u, --uri] URI URI of the rss stream Channel information: [-t, --title] Get title of channel [-l, --link] Get link to channel [-d, --description] Get description of channel [-L, --language] Get language code of channel [-m, --webmaster] Get webMaster's email [-c, --copyright] Get copyright [-p, --pubdate] Get publishing date [-e, --managingeditor] Get managing editor [-g, --generator] Get generator of this feed [-o, --docs] Get link to RSS documentation [-w, --ttl] Get ttl, time that channel can be cached before being updated [-b, --builddate] Get last time the channel's content changed [-Q, --imageurl] Get channel image URL [-I, --imagetitle] Get image title, same as ALT in html [-E, --imagelink] Get link to site, image will act as a link [-W, --imagewidth] Get width of image [-H, --imageheight] Get height of image [-D, --clouddomain] Get domain of feed update service [-P, --cloudport] Get port of feed update service [-A, --cloudpath] Get path to access for feed update service [-R, --cloudregister] Get register procedure for feed update service [-O, --cloudprotocol] Get protocol feed update service uses [-i, --item] INDEX Provide index of item to display If no index is provided, assume the first item in the feed. All following flags will be parsed as item options, till another item is provided Item options: [-t, --title] Get title of item [-l, --link] Get link [-d, --description] Get description [-a, --author] Get author [-C, --category] Get category list [-f, --comments] Get link to comments [-G, --guid] Get GUID [-p, --pubdate] Get publishing date [-s, --source] Get source of item [-U, --enclosureurl] Get enclosure URL [-T, --enclosuretype] Get enclosure MIME type [-K, --enclosurelength]Get enclosure length, in bytes General options: [-h, --help] Show this message For more information, refer to the RSS 2.0 documentation https://validator.w3.org/feed/docs/rss2.html
Breaking this down, we first need the -u flag to say where to get the RSS feed. Once we have that, we can pass flags to grab everything we need. The Channel information flags have to be passed before the item options. Once the -i flag has been passed, all following options must be item options, and will be applied to that item. If -h is passed anywhere, the program will display the help message and quit.
The slowest part of the program will be fetching the file using libcurl, therefore if you plan to do several operations on the same feed, I recommend downloading the file first and using file:// to tell rss-cli where the file is.
All options are also displayed in the order they are listed in --help. This means that even if you run rss-cli with:
rss-cli -u file:///my/local/rss.rss --description --link --title
The output will still be:
RSS Feed Title Feed link Feed description
This makes output predictable and easy for other programs to understand. If a empty line is encountered, then it can be assumed the requested tag is not in the feed. This same concept applies to each item.
Grab headlines from BBC and show the top three in your bash rc
echo $(rss-cli -u http://feeds.bbci.co.uk/news/world/us_and_canada/rss.xml \ -i0 -td -i1 -td -i2 -td)
Grab todays weather and put it in a file, for logging
rss-cli -u http://www.rssweather.com/zipcode/10001/rss.php -i -d >> weather/$(date).txt
This example uses opensource_audio from archive.org, this could be put on a cronjob
wget $(rss-cli -u https://archive.org/services/collection-rss.php?collection=opensource_audio -i0 --enclosureurl) -P ~/archive_audio