Web Scraping RSS

keyboard image

How it started

I requested this project to learn how to web-scrape, how Really Simple Syndication (RSS) functions as a protocol, and to improve my skills at building functional websites. To accomplish these goals, my plan was to:

  • Create a WordPress web page
  • Host the output from RSS feeds for news on the page
  • Eliminate redundant news articles from the feed (data cleaning)
  • Display feeds as either weblinks with timestamps or as "glimpses" with short summary and data showing

I was ambitious and hoped to be able to go further with added functionality to have the website:

  • Scanning feeds using Locality Sensitive Hashing to eliminate articles published by a secondary news source
  • Aggregating 2x daily feeds into "Top News" summaries using Natural Language Processing
  • Publishing summaries to subscribers
  • Adding options for selective news types such as "political news," "economic news," etc.
  • Adding options for simple search terms to gather specific news items
  • Auto-archiving news articles
  • Retrieving feeds from additional news sources such as Reddit, Twitter, etc.

My planned deliverables were:

  • Create a WordPress web page
  • Host the output from RSS feeds for news on the page
  • Eliminate redundant news articles from the feed (data cleaning)
  • Display feeds as either weblinks with timestamps or as "glimpses" with short summary and data showing

How it went

Before I started the project I had already looked at the NewsNow.com website and how they were displaying RSS feed items as Article Title (as a link to the article), Publication Name, and Publication TimeDate. I had also started to gather /RSS and /FEED addresses to over 60 news publishers in anticipation of using several.

Initially, I researched how to setup a WordPress website, to I could start getting a template page online to pull information using JavaScript and display scraped information using HTML and CSS. I quickly discovered that there is a large difference between Wordpress.org and Wordpress.com. While I could (and did) setup a template WordPress (.com) page in minutes, I found out that all plugins and actions require a paid account and, if plugins that did exactly what I wanted (to scrape from a large number of sites) were not available, I would have to build my own plugin in PHP and submit the plugin to WordPress.com for review (which could take weeks) before I could use it. Similarly, WordPress.org would require extensive work in PHP to build my scraping and display functions, though WordPress.org does not have a required fee to use actions of plugins.

I next moved to GitHub, where I discovered that GitHub offers free hosting through GitHub Pages for one repository per account. I found a basic webpage template on html5up.net and created a new repository with configured settings to deploy my website. The template successfully deployed, and I used a few available services to send me pre-formatted news articles, most of which I was able to easily add to the pages, but with major drawbacks. For each of the services I could find, all had limitations in the number of articles and the number of sources they were able to provide to me and the services would not work well together. At this point, unless I decided to use several pay-to-use services from multiple outlets, my only other option was to build my own tool.

I moved on to attempt using a few older tutorials I'd found to try and scrape data from the publishers RSS pages directly. Nothing I tried to do worked following that thread and I was able to determine the problem was Cross-Origin Resource Sharing (CORS). Essentially, after 2018, easy scraping using JavaScript was disabled with the implementation of CORS, which blocks resource requests from any domain which is not its own (without some very extensive configurations and services).

This pushed me to using a full programming language with libraries built to support web-scraping, namely, Python. I found a library called lxml for Python which allowed me to leverage the requests library to scrape everything on the URL passed and then “packaged” the material as an xml object. Using this method, I was able to scrape any RSS page and somewhat successfully parse the data I needed into a dataframe (with some difficulties due to publishers not following a consistent standard for RSS pages).

When I went to configure my Python script to send its dataframe to my webpage I discovered that GitHub Pages are static-only and will not accept configuration as a dynamic web application. I researched several other web-hosting platforms, including Heroku, Docker, Render, and PythonAnywhere, uploading my script and attempting to publish my pages on each until I discovered that all required Python v3.5 or later and my program had been written on Python v2.7. I upgraded to Python v3.8 and attempted to run my scripts locally and lxml was entirely broken.

I looked for another library to use with Python v3.5 or later and found BeautifulSoup, which works much better and more reliably than lxml, and then tried to publish again. This time, I found out that I needed an application framework and not a separate traditional HTML webpage. After some research, I found out that Python has two main web frameworks; Django and Flask. After looking into both, I found that Flask is more thoroughly documented and a simpler framework to use. I setup Flask and ran my scripts in a Flask containerized virtual environment and then built a minimal web app to display my results. I was able to display all 30 of the news publishers I tried, but pruned down to 10 publishers to display on the web app, then pushed my code to my GitHub repo and linked the repo to my PythonAnywhere account and published the web app.


Current status

I have a currently running web-scraping RSS feed web app located here which is only able to display results from two news publishers. I believe the reason the other publishers won't properly connect when my script tries to initiate requests is the same CORS issue I'd previously run into. My web app does have a link to view the dataframe scraped from the publishers' RSS pages. It has another link to view a formatted Flask-generated html table with hyperlinks to the scraped articles and their titles and publishers' names sorted in descending chronological order. The app also has links to my GitHub repo and this project report.


Conclusions

From my core goals and deliverables at the beginning of the project, I was unable to host my web app on WordPress but substituted PythonAnywhere for hosting services. I was not able to use RSS specifications, since they are not functional anymore, but was able to do a workaround with raw web-scraping. I wasn't able to demonstrate data cleaning in elimination of redundant articles, but the minimal nature of my app with only two publishers and constantly pulling only the most recent articles from the publishers eliminates the possibility of redundant articles. I was able to chronologically sort the articles by publication timestamp and was able to provide reference hyperlinks directly to the articles. Overall, I'd call the project a minimal success in terms of the initial goals.

The things I've learned by doing the project, I consider a great success for my professional development as a junior software developer. I learned how to do a lot of research; more research than in any of my standard classes. I've reinforced skills in finding alternative means to accomplish goals when intermediate goals are unachievable, either technically or due to time-constraints. I've learned a lot more about Python, Python's libraries and versions, and Python's web frameworks. I also enjoyed the project quite a bit and plan to go back, learn more about Flask applications and containers, and get a fully functional application to add to my skillset and professional portfolio of work.


My live app:

My code:

Referenced materials: