When we ask the Internet a question, who answers? It’s usually a business—we probably used a commercial search engine, like Google, and got a response from another commercial enterprise. When we ask where to eat dinner, Yelp (a business) answers. When we ask if a new movie is playing, Fandango (another business) answers.
The Internet that most people view and access on a daily basis is a filtered and curated experience maintained by businesses, and only a fraction of the data actually out there. What we don’t see is referred to as the deep web—it’s all the information not indexed by traditional search engines. This doesn’t necessarily refer to shady or nefarious websites—it could be articles behind paywalls, sites without external links, or sites that are trying to obfuscate their identity.
But within the deep web is what’s known as the dark web. Like a Tortuga of the net, the dark web is home to all manner of illegal and unsavory activity, from trafficking of illegal drugs (remember Silk Road?), weapons, and human labor to patent trolling and child exploitation.
Because of the nature of the deep web—it’s difficult enough to even get connected to it, let alone navigate—law enforcement agencies have difficulty tracking and, ideally, stopping illegal activity. Luckily, data science can help.
Memex is an open-source project developed in collaboration between DARPA, NASA Jet Propulsion Laboratory, and Continuum Analytics, that seeks to develop software that advances online search capabilities. Specifically, Memex aims to create a new domain-specific indexing and search paradigm that can be used by law enforcement to search and index the deep web, dark web, and nontraditional (e.g. multimedia) content.
“We realized that for government use cases, the normal ‘one-size-fits-all’ web search was not going to work,” said Katrina Riehl, co-principal investigator for the Continuum Analytics Memex team, during a talk at PyData Seattle 2015. “They need to be able to look at deeper levels of the web, to search and index parts of the web that most people aren’t looking at.”
Memex works by combining traditional web-scraping technologies with high-performance computing. It uses machine learning to focus searches and understand what relevant content looks like.
“We’re talking about indexing huge amounts of data,” Riehl said. “We’re pointing crawlers across and pulling data from all the nether regions of the internet, putting it into indexes and making it searchable. It’s a huge challenge, so we have to integrate several different disciplines.”
Memex uses databases such as mongoDB, MySQL, SQLAlchemy, SQLite, PostgreSQL, which interact with Distributed Systems like Hadoop, Spark, Hive, and Impala. Data mining and machine learning aspects use R and Pandas, while the scientific computing areas use numba, blaze, NumPy, HTF, and PyTables. Mahout and R hadoop are also used for machine learning in order to train models that are used for directed crawling.
Interestingly, the analytics pipeline is actually rather simple—very similar to what you would find looking at a traditional commercial search engine. Web crawlers and scrapers are sent out to collect content and extract entities—pulling things like buying and selling information, phone numbers, addresses, payment information, images—anything that might be identifiable. Memex doesn’t bother looking at network information like IP addresses or server information—those things are ephemeral on the dark web, and prove to not be very good identifiers for finding the people involved.
Another aspect of the Memex project is helping analyze the gathered data. After all, a pile of data from crawling the dark web is no use if law enforcement agencies can’t make sense of it. Memex incorporates visualization tools such as Bokeh, as well as topic modeling, which is an unsupervised procedure that clusters information from documents into different topics that may be relevant to what you’re searching. In other words, it’s a technique to discover interesting patterns within a large collection of text.
The Memex Open Catalog is a collection of DARPA’s open source programs related to the Memex project. Continuum Analytics, for its part, has produced the Memex Explorer, a Django-based web app built upon the Anaconda platform, which provides a front-end way to explore domains, run directed crawls, and visualize information. It uses both the Apache Nutch and NYU Ache crawler, extracting metadata via Apache Tika, then passing content and metadata into Elasticsearch. After indexing, the information is visualized using Kibana or Bokeh.
Watch full video of the talk here: