5 More Tools Data Scientists Need to Know

Published 9/22/2015

Even the most knowledgeable data scientists can level up their skills. When it comes to analyzing the data you compile, there are a ton of great tools for data scientists that can help you get better insights. We talked to our Data Science instructors and put together a list of five more data science tools that you should learn how to use today.

dedup

dedup is a Python library that uses machine learning to perform de-duplication and entity resolution quickly on structured data.

Often data scientists will find themselves with a need to SELECT DISTINCT * FROM my_messy_dataset; Unfortunately the real world is a bit more complicated than this. Whether you have multiple data sources you are trying to aggregate or simply bad data collection hygiene, to begin meaningful analysis you need to de-duplicate your records.

As you can imagine, there are a seemingly endless number of ways to merge data together and rules to define what equivalence means for your data. Are two restaurants with the same address the same business? Are two rows with the same first and last name referring to the same person?

Lucky for you, dedup is there to save the day! Based on innovative Computer Science research, dedup uses machine learning (and more specifically Active Learning) to learn what constitutes “sameness” for two possibly ambiguous records by incorporating human feedback. And even better, it has a GUI so that anyone can use it!

Theano

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.

Theano features:

  • Tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions.
  • Transparent use of a GPU – Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
  • Efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.
  • Speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.
  • Dynamic C code generation – Evaluate expressions faster.
  • Extensive unit-testing and self-verification – Detect and diagnose many types of mistake.

StarCluster

StarCluster has been designed to automate and simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud. StarCluster allows anyone to easily create a cluster computing environment in the cloud suited for distributed and parallel computing applications and systems. This allows you to do interactive processing on a limitless amount of data.

graph-tool

Amidst the proliferation of network and graph analysis libraries for Python, graph-tool shows a ton of promise. While tools like NetworkX and Gephi still have their place in this growing ecosystem of tools, for anyone who has tried to do non-trivial analysis of a large graph — whether a social network, road network, or biological network –.these two standbys tend to fall down.

NetworkX has long been the most popular Python tool for network analysis due to its rich API and low barrier to use, but once you begin to crunch larger graphs it’s pure Python implementation really starts to show. And Gephi is a wonderful graphical tool to interactively visualize and explore a new graph, but has a cumbersome scripting interface making it difficult to programmatically control.

Graph-tool tries to incorporate lessons learned from its predecessors and give data scientists the best of all worlds. Implemented in C++ (with capabilities to execute in parallel) and armed with Python bindings for an easy to use API, it achieves blazing fast speed without compromising on usability. And to make sense of a network, it has functionality to not only draw and visualize graphs but to also interact with and animate graphs.

Plotly

Plotly is an interactive graphing library for R, Python, MATLAB, JavaScript, and Excel. Plotly is also a platform for analyzing and sharing data and graphs.

How is Plotly different? Like Google Docs and GitHub, you can collaborate and control your data, make files public, private, secret, or shared. The options below and more are available if you’re using Plotly’s free public cloud, Plotly Offline, or an on-premise deployment.

Here are three ways you can use Plotly in your workflow:

Integrate with other tools for data scientists. Plotly’s R, Python, and MATLAB APIs let you make interactive, updating dashboards and graphs. Plotly integrates with IPython Notebooks, NetworkX, Shiny, ggplot2, matplotlib, pandas, reporting tools, and databases. For example, the plot below was made with ggplot2 and embedded in this blog. Hover your mouse to see data, click and drag to zoom.

None, None, None, None, None, Fair, Good, Very Good, Premium, Ideal, Fair, Good, Very Good, Premium, Ideal

Create interactive maps. Plotly’s graphing library is built on top of D3.js. For geographic data Plotly supports choropleth, scatter, bubble, subplot, and line maps. You can make maps like the one below like this with R or Python then embed them in blogs, apps, and dashboards.

2014 Global GDP<br />Source: <a href="https://www.cia.gov/library/publications/the-world-factbook/fields/2195.html">CIA World Factbook</a>

Build comprehensive visualizations. You can use Plotly for any visualization need: maps, 2D, 3D, and streaming graphs. Click and move your mouse to rotate this plot, hover to see data, or toggle to zoom.

Parametric Plot

Want to learn more?

Galvanize offers an 8-Week part time workshop, as well as a 12-week full-time program in Data Science that teaches you how to make an impact as a contributing member of a data analytics team.

Learn About our Immersive Programs Register for a Workshop

Sign up to get updates direct to your inbox