Raymii.org
Quis custodiet ipsos custodes?Home | About | All pages | Cluster Status | RSS Feed
totext.py - Convert URL or RSS feed to text with readability
Published: 18-04-2019 | Author: Remy van Elst | Text only version of this article
❗ This post is over five years old. It may no longer be up to date. Opinions may have changed.
Table of Contents
Love plaintext? This script downloads an URL, parses it with readability and returns the plaintext (as markdown). It supports RSS feeds (will convert every article in the feed) and saves every article.
My usecase is twofold. One is to convert RSS feeds to a Gopher site, the second is to get full text in my RSS reader.
The script contains a few workarounds for so-called cookiewalls. It also pauses between RSS feed articles to not do excessive requests.
The readability part is handled by Python, no external services are used.
Recently I removed all Google Ads from this site due to their invasive tracking, as well as Google Analytics. Please, if you found this content useful, consider a small donation using any of the options below:
I'm developing an open source monitoring app called Leaf Node Monitoring, for windows, linux & android. Go check it out!
Consider sponsoring me on Github. It means the world to me if you show your appreciation and you'll help pay the server costs.
You can also sponsor me by getting a Digital Ocean VPS. With this referral link you'll get $200 credit for 60 days. Spend $25 after your credit expires and I'll get $25!
Here's an example of a news article. On the left, the text-only parsed version, on the right, the webpage:
Installation
First install the required libraries.
On Ubuntu:
apt-get install python python-pip #python2
pip install html2text requests readability-lxml feedparser
Other distro's, use the pip
command above.
Clone the repository:
git clone https://github.com/RaymiiOrg/to-text.py
Usage
usage: totext.py [-h] -u URL [-s SLEEP] [-r] [-n]
Convert HTML page to text using readability and html2text.
arguments:
-h, --help show this help message and exit
-u URL, --url URL URL to convert (Required)
-s SLEEP, --sleep SLEEP
Sleep X seconds between URLs (only in rss)
-r, --rss URL is RSS feed. Parse every item in feed
-n, --noprint Dont print converted contents
If you want to run the script via a cronjob, use the -n
option to not have
output.
If the parsing failed, the article will contain the text: parsing failed
.
Examples
python totext.py --rss --url https://raymii.org/s/feed.xml
python totext.py --url https://www.rd.nl/vandaag/binnenland/grootste-stijging-verkeersdoden-in-jaren-1.1562067
Saved text
Every file converted will also be saved to the folder saved/$hostname
. The
filenames are sorted by date.
License
GNU GPLv2.
Tags: bash , gopher , logs , monitoring , pygopherd , python , software