Advanced Data Analytics with Free Tools

In the field of Data Science, there are lots of great tools without charging a penny.  Data Analytic never requires software tools with high price and the analytic result should be the same either using SAS Data Miner or your own R code.

As a data science consulting company, our teams are always using free tools for the analytic or development to server our clients with a number of free tools.  It is also much easier for just calling the libraries or packages rather than coding your own package.

Let’s start with programming languages first.  There are 2 common languages in Data Science and they are:

  1. Python
  2. R

Then, we will discuss about the libraries / packages available for Python and R respectively.  Finally, we would share free software packages recommended to run Python and/or R languages.

 

Buy Me A Coffee

 

R

R is a statistical programming language easy to learn and being developed from another language “S”.  This language is suitable for anyone with no programming background and the coding logic is riding on statistical knowledge.  One of the key advantages of the R language is that it was primarily designed for statistical computing.  Most of the key features are built-in for data scientists or statisticians.  There are bunches of expansion (package) to allow extra features.  We will introduce some common and valuable R packages in the coming session.

Python

Python is now the most common programming language in the world and more than just a data science tool.  It is a programming language for general-purpose and in other words anything in your head.  With its popularity, there are more & more libraries available for data analytics pushing it much more popular in recent years.  For the team of CDS, all new team members are now going for Python rather than R due to the easy creation of API and fit for the communications with IoT.

Another comparative advantage is the general-purpose nature, it is easy to find someone could able to code Python in the market.  It is a bit different to find capable R programmer – more likely with strong statistics background.

R Package

There is an ecosystem of R packages to add functionality on top of the core R language.  Some of the top ones will be listed below:

 

  1. Dplyr is basically utilized for data manipulation in R and really worked around these 5 capacities.
  • pick certain data columns.
  • Filter explicit rows of information.
  • Sort your data by rows
  • Mutate your data frame with new data fields
  • Summarize chunks of data.
  1. Ggplot2 is the one of the best libraries for data visualization in R and implements standardized graphs.
  2. Esquisse is another data visualization package and many professionals claim it as “Tableau in R”. It allows you to plot not only bar graphs, curves, scatter plots and histograms but possible to export the graph or retrieve the code generating the graph.
  3. Shiny which is a very famous package in R. It is aimed to help building interactive web apps on top of data analytic / data manipulation results.  Also, it is possible to host standalone apps on a webpage or embed them in R Markdown documents or dashboards.  You can also extend your Shiny apps with CSS themes, htmlwidgets, and JavaScript actions.
  4. R Markdown – which facilitates the creation of reports using R. R Markdown documents are text files containing code segments being interleaved with Markdown text.
  5. The mlr software package provides a standard set of syntax and functions that enable you to use machine learning algorithms in R. Although R has built-in machine learning capabilities, it is cumbersome to use.  Mlr provides a simpler interface which helps you can focus on training the model.
  6. BioConductor – If you are in health industry, you’ll find this very useful for genomic data. We share this package because we do have much interests in working with healthcare analytics.  To install Bioconductor Packages, you need to install biocmanager.
  • Graphics: geneplotter, hexbin.
  • Annotation: explain, AnnBuilder (data package)
  • Pre-preparing Affymetrix oligonucleotide chip information: affy, affycomp, affydata, makecdfenv, vsn.
  • Pre-preparing two-shading spotted DNA microarray information: limma, marrayClasses, marrayInput, marrayNorm, marrayPlots, marrayTools, vsn.
  • Differential quality articulation: edd, genefilter, limma, multtest, ROC.
  • Graphs and systems: diagram, RBGL, Rgraphviz.
  • Analysis of SAGE information: SAGElyzer.

Python Libraries

  1. Scrapy – which is one of the most Python libraries fit for information mining. Scracy help to building crawling programs (spider bots) that can dig into organized information from the web – such as URLs or contact data, etc.
  2. BeautifulSoup is another extremely well-known library for web crawling and data scraping. Once you would like to collection website data, BeautifulSoup can assist you with scratching it and format it into the structure you need.
  3. NumPy (Numerical Python) is one of our team favourite as a must-have tool for data science activities including basic and advanced ones. The library offers numerous functions and features performing n-arrays and metrics in Python.  It assists with handling arrays that store values in same data type and lead to easy operations with better performance.
  4. SciPy incorporates modules for linear algebra, integration, optimization, and statistics. Its fundamental was based upon NumPy and utilize its array in this library.  SciPy works incredible for a wide range of scientific programming needs (engineering, mathematics, and science).  So, it is also popular in many different science laboratories.
  5. Pandas is a library made to assist developers for relational data and labeled one intuitively. It depends on two data structures: one-dimensional (similar a list) and two-dimensional (similar to a table).  Pandas is able to transform data structures to DataFrame objects, taking care of missing information, and adding/deleting columns from DataFrame, assigning missing documents, and plotting histogram with data or plot box.  It’s an absolute necessity for data wrangling, manipulation, and visualization.
  6. Keras is an incredible library for neural networks and deep learning. It’s exceptionally clear to utilize and gives engineers a decent level of extensibility.  The library makes use of the features from others like Theano or TensorFlow. To use Keras is just calling the library, you don’t need the knowledge of the TensorFlow framework much.
  7. SciKit-Learn is an industry-standard for data science projects based on Python. Scikits is a gathering of library bundles in the SciPy Stack that were made for specific functionalities – for instance, computer image processing.  Scikit-learn utilizes the operations of SciPy to the most widely recognized AI calculations via the simple interface.
  8. PyTorch is a framework ideal for data scientists or researchers who need to perform deep learning tasks without facing the complexity of TensorFlow or other frameworks. The library permits performing tensor calculations with GPU processing.  Moreover, there are more use cases like for dynamic computational graphics and automatic gradient calculations.  PyTorch depends on Torch, which is an open-source deep learning library executed in C, with a wrapper in Lua.  So, it is now on top of Python and named PyTorch.
  9. TensorFlow is a famous Python framework originated from the technology gaint – Google. It’s one the best tools in neural network for tasks like object identification, speech recognition, and many others.  It helps in working with AI that needs to deal with different sources of data. The library incorporates different layer-helpers including tflearn, tf-thin, and skflow.
  10. Matplotlib is a standard data science library that assists with creating data visualizations like two-dimensional charts and diagrams (histograms, scatterplots, non-Cartesian directions diagrams). Matplotlib is an extremely helpful plotting libraries in data science implementation.  Moreover, an object-oriented API is provided for embedding plots into applications.
  11. Seaborn is built on top of Matplotlib and acts as an useful Python machine learning tool to visualize statistical models – heatmaps and different kinds of charts to prepare summary of data and showing the trends. When utilizing this library, you get the opportunity to be benefits by a broad range of visualization like time series, joint plots, and violin graphs.
  12. Plotly library works very well in interactive web applications. The library is always under rapid development with new graphics and features for supporting multiple linked views, animation, and crosstalk integration.
  13. Pydot library generates both oriented and non-oriented charts. Actually, it is just serving as the an interface to Graphviz (a function written in Python).  Unfortunately, it is difficult to use Graphviz directly without the Pydot interface library.  Thus this library helps much for developers to build algorithms based on neural networks.

Software

In order to run Python & R, there are 3 great software packages highly recommended.

  1. RStudio – There are several different products provided by RStudio (the company who make Rstudio Desktop). RStudio Desktop is the most popular environment for working with R. It includes a code editor, an R console, notebooks, tools for plotting, debugging, and more.  Additionally, Rstudio is ranked number one of modern R development (from free to Enterprise server), employing the developers of the tidyverse, shiny, and other important R packages.
  1. Jupyter Notebook – Jupyter Notebook is the most popular environment for working with Python for data science. Similar to R Markdown, Jupyter notebooks allow you to combine code, text, and plots in a single document which makes data work easy. Moreover, Jupyter notebooks can be exported in a number of formats including HTML, PDF, and more.  Jupyter Notebooks are being widely used by our team across different project development.  Also, it is believed that most data analysts and scientists are using this tool in real-world applications.
  2. Anaconda is a distribution of Python designed specifically to help you get the scientific Python tools installed and most of the libraries are available. Before Anaconda, the only option was to install Python by itself, and then install packages like NumPy, pandas, Matplotlib one by one.  To be honest, it is not easy for anyone to handle tons of libraries by installation individually by pip or whatever tools.

Anaconda includes all of the main packages needed for data science in one easy install, which saves time and allows you to get started quickly. It also has Jupyter Notebooks built-in, and makes starting a new data science project easily accessible from a launcher window.  On the other hand, the spyder is also included in the Anaconda package for another IDE to run and test Python.  Also, it is possible to use R on top of Anaconda.

All-in-all, these free packages and software should be enough for data analysts in all applications.  However, you may ask whether there is any database / Big Data repository to recommend and which is the best ETL tool (best-fit tool for unique situation).  We will share it later.

 

Witten by Samuel Sum, Vice President of AS / CDS / SDI with his team (2020-6-1)

0Shares