Open Source Initiative

Start your Data Analytics without Big Investment

Many small business owners would not start their first data analytics project due to price tag of Enterprise software like several hundred US dollar annual subscription for just data visualization of 1 user.  For the investment, it could be US$100,000 to 400,000 for just payment on software.

 

It is possible for organization to work with Open Source and/or freeware as the start.  The life cycle for data analytics should be similar no matter free or Enterprise tools.  There are several baskets of tool to be considered:

  1. Data Integration (ETL tool) – it is possible to start with Apache Nifi, Streamsets (Open Source) and Talend Community Version (Free version instead of Enterprise version). I don’t suggest Pentaho because its design is basically similar to those DataStage/ Informatica; which is lack of metadata management function(s) and the most expensive one for taking Enterprise support once needed.  If you are taking open source ETL tool, you should have a strong technical team to support the tools.  For vendor based Community version like Talend, there is no direct technical support by vendor.  However, the features should be already part of the Enterprise software and more likely a stable software suite.  Nevertheless, the community version should be limited by the functions and features available.
  2. Business Intelligence Tool – there was a full feature open source Business Intelligence tool, SpagoBI but it is now changed to KnowAge with lots of limitation in their free version. It is possible to use BIRT (more development work needed), Seal Reports and ELK (suitable for log analysis, lots of development work expected).  JasperSoft community edition is a good candidate for the start of building reports.  However, there is no dashboard or ad-hoc analysis for the community edition.  Recently, there is another tool to be considered – Apache Superset.
  3. Database – if you are planning to have a data warehouse or Operational Data Store, a relational database system is a must. There are lots open source alternatives to paid RDBMS like MySQL, MariaDB and PostgreSQL  I would like to suggest MariaDB with the free version to start with and possible to subscribe maintenance service if needed.  The functions and features are very powerful but not controlled by Oracle (MySQL does).
  4. Hadoop / NoSQL database – for unstructured / semi-structured data, it is better to consider Hadoop and/or NoSQL database for the data repository. For Hadoop, it is good to take the ordinary Apache version rather Cloudera, MapR or HortonWorks.  It is not only the cost of support by these vendors but Cloudera and HortonWork integration is still not clear.  For NoSQL database, there are lots of choice and our choice for data analytics is Apache Cassandra rather than MongoDB.

Buy Me A Coffee

 

BigData Landscape

 

The above is just our opinions after experiments carried in our lab for less than 2 terabytes of testing data in use.  To sum up, software cost is just part of the project implementation but the key factor is the team implementing the data analytics platform.  You may consider to hire consultants but the continuous maintenance & future development should also be taking into account.  Data analytics should always be a long-term journey aligning to the business growth & development.  More & more companies are doing business by the supports of insights and forecast.  Take your action now.

 

 

Samuel Sum

Data Science Evangelist (CDS, SDi)

Vice President (AS)

10Shares