How to get pregnant for the 10th time

People love to leave their mark in the world.Some are doctors and save lives everyday, some win Nobel prizes for extraordinary discoveries, and some fill the Earth with their genetic material —…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




My notes on our approach for project

Introduction

The Project Literature survey outline is about what technologies and methods we will use in our project (Crime Analysis in Chicago) to achieve our defined goal which is to provide resourceful insights about the crimes trend in the city of Chicago which can lead to the reduction of crime in Chicago or any other cities.

Project Scope

The scope of the project is limited to apply queries to the dataset in order to retrieve the proper amount of data which helps up answer the following questions:

Since we are working with a huge amount of data and our project requires complicated and chained queries, we need to have a reliable platform and technologies to analyze and extract desirable information. Therefore we did research to find out: first, which technology will best fit our needs, and second, what are the methods that we can use to analyze our data.

The rest of this survey is as follow: section one discusses the available technologies for big data management and our decision to use the Spark framework. section two explains the methods and tools we are going to use in our project.

1. Technology to use

When we talk about big data analytics framework, there are two names that always comes first: MapReduce[1] and Spark[2]. These two popular, open source frameworks hide all the complexity of parallel computing, fault tolerance, load balancing and execution model behind a simple programming API. This two framework have their own characteristics and benefits. They are both running on Hadoop Distributed File System (HDFS)[3]. However, by running different evaluation, we can realize that Spark is actually running faster than MapReduce in many applications such as regression algorithms and PageRank. The reason behind this difference is because of In-Memory computation and reduce CPU and disk overhead because of RDD[4]. We can compare these two frameworks in different aspects in order to find a suitable system for our project, for instance:

All in all, based on the needs that we have on this project and based on the fact that Spark in-memory computation is much more faster and efficient in today’s world, we are going to use Spark as our main framework.

2. Methods

To find out what methods will help us to efficiently analyze our data, we did research to find out what are the most common methods used in query based data analyzing the project. As it turns out some of the most common methods used in such big projects are: data preprocessing, indexing, query processing, and visualization. So we studied the available techniques and tools that provide a platform for these methods. A survey of our study is provided below.

Data Preprocessing

There are so many tools for data Preprocessing like Stanford Visualization Group’s Data Wrangler[5], Python pandas[6], OpenRefine[7]. These tools are fantastic and can save hours. There is overlap in their functionality as well. However since running and start using OpenRefine is incredibly easy and convenient we used this tool to gain an insight into our data. Using OpenRefine we find out that based on our goal in the project, the Chicago Crime dataset requires one of the most important data pre-processing procedure which is cleaning. Our data need to be clean by:

It is important to notice we used OpenRefine to get a fast insight into our data, however in our project we will use Spark to apply cleaning procedure into our data.

Indexing

Ideally, Apache Spark does not support indexing since Spark is not a data management system but a fast batch data processing engine. Since it doesn’t own the data, it is using it cannot reliably monitor changes and as a consequence cannot maintain indices. However using the below two tweaks we can accelerate the Spark SQL query execution time,

At this level of our project due to our narrow experience in working with Spark, we are not confident if we use indexing in our project. However, if we find it useful for our project we will take advantage of this technique.

Spark SQL

Spark SQL intends to integrate relational processing with Spark itself. It builds on the experience of previous efforts like Shark and introduces two major additions to bridge the gap between relational and procedural processing: DataFrame API that provides a tight integration between relational and procedural processing by allowing both relational and procedural operations on multiple data sources. And catalyst, a highly extensible optimizer which makes it easy to add new data sources and algorithms[10].

Spark SQL uses a nested data model based on Hive[11] and supports all major SQL data types along with complex (eg array, map, etc) and user-defined data types. It ships with a schema inference algorithm for JSON and other semistructured data. This algorithm is also used for inferring the schemas of RDDs (Resilient Distributed Datasets) of Python objects. The algorithm attempts to infer a static tree structure of STRUCT types (which may contain basic types, arrays etc) in one pass over the data. The algorithm starts by finding the most specific Spark SQL type for each record and then merges them using an associative most specific supertype function that generalizes the types of each field.

A DataFrame is a distributed collection of rows with the same schema. It is equivalent to a table in a Relational Database Management System(RDBMS). They are similar to the native RDDs of Spark as they are evaluated lazily, but unlike RDDs, they have a schema. A DataFrame represents a logical plan and a physical plan is built only when an output function like save is called. Deferring the execution in this way makes more space for optimizations. Moreover, DataFrames are analyzed to identify if the column names and data types are valid or not.

DataFrames supports query using both SQL and a Domain-Specific Language (DSL) which includes all common relational operators like select, where, join and groupBy. All these operators build up an Abstract Syntax Tree (AST) of the expression (think of an expression as a column in a table), which is then optimized by the Catalyst. Spark SQL can cache data in memory using columnar storage which is more efficient than Spark’s native cache which simply stores data as JVM objects. The DataFrame API supports User-defined functions (UDFs) which can use the full Spark API internally and can be registered easily.

To query native datasets, Spark SQL creates a logical data scan operator (pointing to the RDD) which is compiled into a physical operator that accesses fields of the native objects in-place, extracting only the field needed for a query. This is better than traditional object-relational mapping (ORM) which translates an entire object into a different format. The ability to query native datasets lets users run optimized relational operations within existing Spark programs. Moreover, it makes it simple to combine RDDs with external structured data.

With all these features: analyzing structured data, executing SQL queries supports for user-defined functions, and integrability that lets us make mixes of SQL queries for running complex queries (which is what we will do in our project) using Spark SQL sounds promising in our project.

Discoveries: Data Visualization

There are many different types of big data visualization tools, and it is essential for the tools to have the capability to process multiple types of incoming data, apply various filters to adjust results, interact with data sets during analysis, connect other software to receive incoming data and provide collaboration options for other [12]. Only having the data have no value, so using the data visualization tools is able to bring out the trends and patterns to clearly understanding the data collection. Currently, the top four data visualization tools are JupyteR, Google Charts, D3.js and Tableau which surpass the users’ expectations.

From the four tools, JuptyeR is an open source project that used the inputted codes with over 40 different programming languages like Python, Java and more to be used for tools to execute and display the visual image of the data. Also, JuptyeR can interact with various big data tools like Spark, and it can do data cleaning and transformation [13]. In addition, Google Charts provide numerous visualization charts that are compatible with different browsers and platforms, and it is able to make broad customization to the visualization that users desire by using JavaScript embed into the website [14]. For D3.js, it is a JavaScript library that use the data to bring the visualization by binding arbitrary data to the Document Object Model, and it provides efficient manipulation of documents based on data [15]. Lastly, Tableau can be integrated with Hadoop, MySQL and Amazon AWS which can create a detailed graph to find the value in data and existing investment, and it is able to make the complex big data into easy to explore and accelerate performance[16].

From looking over the four visualization tools and with our narrow experience in Spark, we currently decide to use JuptyeR as the tools to help visualize our project since it leverages big data tool like Apache Spark and uses Java as a programming language.

Resources

[1]J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. CACM, 51(1):107–113, 2008.

[2]M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

[4]Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI’12). USENIX Association, Berkeley, CA, USA, 2–2.

[8]Cui, Y. et al., 2007. Indexing for Large Scale Data Querying based on Spark SQL. The Fourteenth IEEE International Conference on e-Business Engineering.

[10]Armbrust, M. et al., 2015. Spark SQL. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data — SIGMOD 15.

Add a comment

Related posts:

To Crowd and To Campfire

I only needed to witness a blizzarding snowfall in New York City one time to not forget it. And I’ve only experienced it once. I chose to stop in my tracks for a couple minutes to really be present…

Crispy Harissa and Lemon Chicken with Potatoes

I love it when a recipe that has been on the “must make” list turns out so well. I saw the recipe for this Crispy Harissa and Lemon Chicken with Potatoes on Bon Appetit a while ago and it immediately…

How to Sell Travel Hat to a Skeptic

Snapback caps are a popular kind of headwear that include a level border, a structured crown, as well as an adjustable snap closure at the back. The design of the snapback cap has been around for…