Back to train(ing)s!

I have found a bit more inspiration to work with MongoDB and I also returned to the train data. But, instead of continuing with the train schedules I looked at the historical data as I had one question on my mind: Is the S9 train always late?

Well, it seems that it has been late and when tweaking the queries long enough and plotting the data just right you can see a “message” from the train operator 😀

s9

The above graph shows the differences between the scheduled and actual running times (when the train was more than 15 mins late) for the S9 arriving at my home city. The data is from June 2016 to mid-March 2017 and unless I’m mistaken it looks like that the data is giving me the finger 🙂 The graph was done in Excel based on the JSON data residing in MongoDB.

I also took my first stab at the matplotlib and plotted the difference for the whole data. And at the first glance it does support my hypothesis that the S9 was late fairly often. I’ll try to work with matplotlib more and produce more graphs and also work with pandas in order to analyse the data a bit more. But it’s good to have some validation for my gut feeling.

s9_matplot

Trainings

As the title hinted, I have continued my learning path and I’m subscribing two MongoDB courses at the MongoDB University: M101P and M102 which are Mongo for Developers (Python) and MongoDB for DBAs. I already completed the M201 (MongoDB Performance) and I must say that the contents and the facilities (videos and the trainers) are spot on.

The main goal is to sit through the certifications for M101 and M102 in the summer and these little side-projects help a lot. So, good times at the moment!

Advertisements

It’s (not?) a wonderful world of SAS

In the Fast Show there was a running skit ‘This week’. Well, this week I’ve been mostly programming with SAS Base. And I’m not too impressed. The reason being that it’s been used as an ETL tool. Really. The data warehouse is a legacy system which has been growing over the years and therefore it would be rather tedious project to rebuild it. But still…

The good thing about it is that I get to learn SAS Base and I’m happy to learn it. Let’s see how this goes, I’ll try summarise my feelings in the next couple of weeks as I’m not really sure how to feel about this. As you may have noticed 🙂

edit. A couple of more weeks later I still find the use of SAS (Base) as an ETL tool ridiculous but I have also learned a lot and it’s not such a black box as it was two weeks ago. So not my cup of tea but I’ll take this project as a learning experience. Also shows how much I have learned Python as I tend to compare everything against Python and wishing we were programming with Python rather than with SAS. Not that it really would make sense to use Python as an ETL tool either.

Data migration

As I haven’t had too much time to focus on the cool, non-work related stuff I figured I could share my view of the task at hand: Data migration.

We have spent a lot of time waiting on the data as it’s not on our or our customer’s hand when it lands. We eventually received some test data and I drafted few points for the team to work on.

  1. When the data truly comes outside of the organization then the first thing you ask is the metadata. If the data is to be inserted into database the first question should be: How is the data structured? That means fields, their length and type, the amount of the data. You can start preparing the database and even create your own mock-up data when you know those things.
  2. The second thing to task yourself is the data mapping. That always requires business knowledge but usually the destination data structure is known so you can do a first run of the mapping without the help of business if they are not available. Get this right before you continue with the actual development. It’s always costly to fix things later.
  3. Be aware of the lookup or reference data. That data should also be maintained centrally in an MDM system or similar as it helps in a continuous process.
  4. Separate the initial/staging data from the wip data and the result data. Preferably contain the different stages of data into different databases. This is also a security issue as you don’t want to expose the initial or interim data to end-users or other systems.
  5. Test test and test. Unit test small enough data flow tasks and also test any components which transform or filter the data.
  6. Performance test the data flows with as complete data sets as possible. What works fast with 5% of the data might collapse when you are running with full data set. Again, getting this right as soon as possible is better, faster and cheaper than trying to fix the performance issues in the production.
  7. Create data quality metrics/reporting. Even though you might not own the data it’s your responsibility to address the data quality issues by letting the data owners or source data extractors of the issues. A clear and transparent DQ metrics will help the whole organization even though some people might frown and think it as “name and shame”. In a data-driven company the metadata is always appreciated and should guide the decisions and not be the news itself.

Not sure if you agree but find these to be common things which I would plan into any data migration project.

Playing with Mongo

So, since I got the JunaTracker working in pythonanywhere.com I started thinking about the performance. Which is not that great at the moment, although this will never really be a public application.

Regardless, I started playing with the MongoDB and after some head scratching I managed to get the data import working. The root cause for head scratching was I that wanted to create a composite index based on the version of the timetable, train number and departure date. This is much faster than trying to figure out the updated fields and actually update them. So all fun and sun shine, right?

However the rata API seems to resend some of the old data even when querying with version number (the idea is that it should send data AFTER that version) and obvisously that caused issues with duplicates. So I did what any Python newbie would do and tried to tackle that with

try:
db.insert(data)
except pymongo.errors.DuplicateKeyError,e:
pass

Obviously that only works up until the first error and then the exception causes the program to exit.

But with some thinking and a bit of help from SO I figured to encapsulate the try block into a for loop. So the exception would only exit the current loop iteration and continue to the next one.


for item in data:
try:
collection.insert(item)
except pymongo.errors.DuplicateKeyError,e:
pass

So live and learn. I learned more about exceptions in Python and also was able to control the data flow (insert and query). So perhaps the JunaTracker will be migrated into MongoDB in future.

JunaTracker back on track

I finally made some sort of breakthrough and was able to make the webserver working after some difficulties. The answer was: Don’t do it yourself! I purchased the developer hacker account from PythonAnywhere.com and was able to get the site working. It’s still under heavy development but at least it is accessible. And will be to the public at some point.

I’ll study the settings there and try to duplicate the results on my Raspi setup. But the main thing is that I can finally continue with the actual development. So patience is virtue in some occasions. Happy days!

Data Analysis with Python

I’m getting back on track to code daily, although I still haven’t really revisited the JunaTracker. But better to code something else than nothing at all 🙂

Lately I my interested has been data analysis with Python, woken by the fact that I might actually be assigned to an analytics project. So I have taken the baby steps of learning the data wrangling and basic analytics programming. A huge help and inspiration has been Udacity as discussed in the Learning – how? post, and in more details the Intro to Data Analysis course (free by the way).

A lot of the work would be easier to do in a database but I’ll grind and learn the basics with Python. I mean that’s the only way to truly learn to code and it is fun in a geeky way 😉

So hopefully in the next couple of weeks I have completed the course and have loads of new skills and ways of thinking in the bag.

I’m still here

I’ve been a bit too quiet and the main reasons are work and the overall tiredness resulting from having no daylight. I just don’t seem to have energy to start executing all the ideas I have. For example I purchased an Explorer HAT Pro for the Raspberry but haven’t really done anything with it. Despite all the ideas and example projects I have.

But these first months of the blog have been fairly successful as I have finally took the bull by the horns and have actually done a lot of stuff I always wanted: found and delivered a Python project (the train timetable app), bought a bunch of RPIs and setup a Hadoop cluster, have come up with ideas for the future projects. I still have a long path ahead of me and I’m actually embrasing the idea of transforming the path into meaningful career and a way of living. Which sounds kind of cheesy but at least I have been able to break the habit of “just working for the money” and that keeps me going even in the deep dark days like this.

I’d like to give some thoughts on the projects I’m going to deliver next year (private projects but who knows if something spins off to professional work):

  1. learn how to use the sensors and learn real-life use cases for them
  2. learn (finally) how to setup the nginx or other web server for the Django project
  3. continue to build my Python skillset and start utilising it at work (already have in small scale)
  4. start experimenting with electronics. First with HATs for RPIs and then perhaps with Arduino.
  5. learn and deepen my understanding of clustered computing in databases using RPI cluster I have.