5 More Tools Data Scientists Need to Know

1145

Even the most knowledgeable data scientists can level up their skills. When it comes to analyzing the data you compile, there are a ton of great tools for data scientists that can help you get better insights. We talked to our Data Science instructors and put together a list of five more data science tools that you should learn how to use today.

dedup

dedup is a python library that uses machine learning to perform de-duplication and entity resolution quickly on structured data.

Often data scientists will find themselves with a need to SELECT DISTINCT * FROM my_messy_dataset; Unfortunately the real world is a bit more complicated than this. Whether you have multiple data sources you are trying to aggregate or simply bad data collection hygiene, to begin meaningful analysis you need to de-duplicate your records.

As you can imagine, there are a seemingly endless number of ways to merge data together and rules to define what equivalence means for your data. Are two restaurants with the same address the same business? Are two rows with the same first and last name referring to the same person?

Lucky for you, dedup is there to save the day! Based on innovative Computer Science research, dedup uses machine learning (and more specifically Active Learning) to learn what constitutes “sameness” for two possibly ambiguous records by incorporating human feedback. And even better, it has a GUI so that anyone can use it!

Contributed by Jonathan Dinu, VP of Academic Excellence at Galvanize.

Theano

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.

Theano features:

  • Tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions.
  • Transparent use of a GPU – Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
  • Efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.
  • Speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.
  • Dynamic C code generation – Evaluate expressions faster.
  • Extensive unit-testing and self-verification – Detect and diagnose many types of mistake.

Contributed by Mike Tamir.

StarCluster

StarCluster has been designed to automate and simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud. StarCluster allows anyone to easily create a cluster computing environment in the cloud suited for distributed and parallel computing applications and systems. This allows you to do interactive processing on a limitless amount of data.

Contributed by Alessandro Gagliardi, Lead Data Science Instructor at Galvanize.

graph-tool

Amidst the proliferation of network and graph analysis libraries for Python, graph-tool shows a ton of promise. While tools like NetworkX and Gephi still have their place in this growing ecosystem of tools, for anyone who has tried to do non-trivial analysis of a large graph — whether a social network, road network, or biological network –.these two standbys tend to fall down.

NetworkX has long been the most popular Python tool for network analysis due to its rich API and low barrier to use, but once you begin to crunch larger graphs it’s pure Python implementation really starts to show. And Gephi is a wonderful graphical tool to interactively visualize and explore a new graph, but has a cumbersome scripting interface making it difficult to programmatically control.

graph-tool tries to incorporate lessons learned from its predecessors and give data scientists the best of all worlds. Implemented in C++ (with capabilities to execute in parallel) and armed with Python bindings for an easy to use API, it achieves blazing fast speed without compromising on usability. And to make sense of a network, it has functionality to not only draw and visualize graphs but to also interact with and animate graphs.

Contributed by Jonathan Dinu, VP of Academic Excellence at Galvanize.

Plotly

Plotly is an interactive graphing library for R, Python, MATLAB, JavaScript, and Excel. Plotly is also a platform for analyzing and sharing data and graphs.

How is Plotly different? Like Google Docs and GitHub, you can collaborate and control your data; make files public, private, secret, or shared. The options below and more are available if you’re using Plotly’s free public cloud, Plotly Offline, or an on-premise deployment.

Here are three ways you can use Plotly in your workflow:

Integrate with other tools for data scientists. Plotly’s R, Python, and MATLAB APIs let you make interactive, updating dashboards and graphs. Plotly integrates with IPython Notebooks, NetworkX, Shiny, ggplot2, matplotlib, pandas, reporting tools, and databases. For example, the plot below was made with ggplot2 and embedded in this blog. Hover your mouse to see data, click and drag to zoom.

None, None, None, None, None, Fair, Good, Very Good, Premium, Ideal, Fair, Good, Very Good, Premium, Ideal

Create interactive maps. Plotly’s graphing library is built on top of D3.js. For geographic data Plotly supports choropleth, scatter, bubble, subplot, and line maps. You can make maps like the one below like this with R or Python then embed them in blogs, apps, and dashboards.

2014 Global GDP<br>Source: <a href="https://www.cia.gov/library/publications/the-world-factbook/fields/2195.html">CIA World Factbook</a>

Build comprehensive visualizations. You can use Plotly for any visualization need: maps, 2D, 3D, and streaming graphs. Click and move your mouse to rotate this plot, hover to see data, or toggle to zoom.

Parametric Plot

Contributed by Matt Sundquist, COO & Co-founder of Plotly

Still hungry for more? Check out 7 Python Tools All Data Scientists Should Know How to Use.

Want more data science tutorials and content? Subscribe to our data science newsletter.