In this post, we will be performing analysis on the Uber dataset in Hadoop using MapReduce in Java.
The Uber dataset consists of four columns; they are dispatching_base_number, date, active_vehicles and trips. You can download the dataset from here.
Problem Statement 1:
In this problem statement, we will find the days on which each basement has more trips.
Source Code
Mapper Class:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ java.text.SimpleDateFormat format = new java.text.SimpleDateFormat("MM/dd/yyyy"); String[] days ={"Sun","Mon","Tue","Wed","Thu","Fri","Sat"}; private Text basement = new Text(); Date date = null; private int trips; public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { String line = value.toString(); String[] splits = line.split(","); basement.set(splits[0]); try { date = format.parse(splits[1]); } catch (ParseException e) { // TODO Auto-generated catch block e.printStackTrace(); } trips = new Integer(splits[3]); String keys = basement.toString()+ " "+days[date.getDay()]; context.write(new Text(keys), new IntWritable(trips)); } } |
From the Mapper, we will take the combination of the basement and the day of the week as key and the number of trips as value.
First, we will parse the date, which is in string format into date format using SimpleDateFormat class in Java. Now, to take out the day of the date, we will use the getDay() which will return an integer value with the day of the week’s number. So, we have created an array which consists of all the days from Sunday to Monday and have passed the value returned by getDay() into the array in order to get the day of the week.
Now, after this operation, we have returned the combination of Basement_number+Day of the week as key and the number of trips as value.
Reducer Class:
In the reducer, we will calculate the sum of trips for each basement and for each particular day, by using the below lines of code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } |
Whole Source Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
import java.io.IOException; import java.text.ParseException; import java.util.Date; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class Uber1 { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ java.text.SimpleDateFormat format = new java.text.SimpleDateFormat("MM/dd/yyyy"); String[] days ={"Sun","Mon","Tue","Wed","Thu","Fri","Sat"}; private Text basement = new Text(); Date date = null; private int trips; public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { String line = value.toString(); String[] splits = line.split(","); basement.set(splits[0]); try { date = format.parse(splits[1]); } catch (ParseException e) { // TODO Auto-generated catch block e.printStackTrace(); } trips = new Integer(splits[3]); String keys = basement.toString()+ " "+days[date.getDay()]; context.write(new Text(keys), new IntWritable(trips)); } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "Uber1"); job.setJarByClass(Uber1.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } |
Running the Program:
First, we need to build a jar file for the above program and we need to run it as a normal Hadoop program by passing the input dataset and the output file path as shown below.
hadoop jar uber1.jar /uber /user/output1
In the output file directory, a part of the file is created and contains the below output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
B02512 Sat 15026 B02512 Sun 10487 B02512 Thu 15809 B02512 Tue 12041 B02512 Wed 12691 B02598 Fri 93126 B02598 Mon 60882 B02598 Sat 94588 B02598 Sun 66477 B02598 Thu 90333 B02598 Tue 63429 B02598 Wed 71956 B02617 Fri 125067 B02617 Mon 80591 B02617 Sat 127902 B02617 Sun 91722 B02617 Thu 118254 B02617 Tue 86602 B02617 Wed 94887 B02682 Fri 114662 B02682 Mon 74939 B02682 Sat 120283 B02682 Sun 82825 B02682 Thu 106643 B02682 Tue 76905 B02682 Wed 86252 B02764 Fri 326968 B02764 Mon 214116 B02764 Sat 356789 B02764 Sun 249896 B02764 Thu 304200 B02764 Tue 221343 B02764 Wed 241137 B02765 Fri 34934 B02765 Mon 21974 B02765 Sat 36737 B02765 Sun 22536 B02765 Thu 30408 B02765 Tue 22741 B02765 Wed 24340 |
In this problem statement, we will find the days on which each basement has more number of active vehicles.
Source Code
Mapper Class:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ java.text.SimpleDateFormat format = new java.text.SimpleDateFormat("MM/dd/yyyy"); String[] days ={"Sun","Mon","Tue","Wed","Thu","Fri","Sat"}; private Text basement = new Text(); Date date = null; private int active_vehicles; public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { String line = value.toString(); String[] splits = line.split(","); basement.set(splits[0]); try { date = format.parse(splits[1]); } catch (ParseException e) { // TODO Auto-generated catch block e.printStackTrace(); } active_vehicles = new Integer(splits[3]); String keys = basement.toString()+ " "+days[date.getDay()]; context.write(new Text(keys), new IntWritable(active_vehicles)); } } |
From the Mapper, we will take the combination of the basement and the day of the week as key and the number of active vehicles as value.
First, we will parse the date which is in string format to date format using SimpleDateFormat class in Java. Now, to take out the day of the date, we will use the getDay(), which will return an integer value with the day of the week’s number. So, we have created an array which consists of all the days from Sunday to Monday and have passed the value returned by getDay(), into the array in order to get the day of the week.
Now, after this operation, we have returned the combination of Basement_number+Day of the week as key and the number of active vehicles as value.
Reducer Class:
Now, in the reducer, we will calculate the sum of active vehicles for each basement and for each particular day, using the below lines of code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } |
Whole Source Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
import java.io.IOException; import java.text.ParseException; import java.util.Date; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class Uber2 { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ java.text.SimpleDateFormat format = new java.text.SimpleDateFormat("MM/dd/yyyy"); String[] days ={"Sun","Mon","Tue","Wed","Thu","Fri","Sat"}; private Text basement = new Text(); Date date = null; private int active_vehicles; public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { String line = value.toString(); String[] splits = line.split(","); basement.set(splits[0]); try { date = format.parse(splits[1]); } catch (ParseException e) { // TODO Auto-generated catch block e.printStackTrace(); } active_vehicles = new Integer(splits[2]); String keys = basement.toString()+ " "+days[date.getDay()]; context.write(new Text(keys), new IntWritable(active_vehicles)); } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "Uber1"); job.setJarByClass(Uber2.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } |
Running the Program:
First, we need to build a jar file for the above program and run it as a normal Hadoop program by passing the input dataset and the output file path as shown below.
hadoop jar uber2.jar /uber /user/output2
In the output file directory, a part file is created and contains the below output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
B02512 Fri 2221 B02512 Mon 1676 B02512 Sat 1931 B02512 Sun 1447 B02512 Thu 2187 B02512 Tue 1763 B02512 Wed 1900 B02598 Fri 9830 B02598 Mon 7252 B02598 Sat 8893 B02598 Sun 7072 B02598 Thu 9782 B02598 Tue 7433 B02598 Wed 8391 B02617 Fri 13261 B02617 Mon 9758 B02617 Sat 12270 B02617 Sun 9894 B02617 Thu 13140 B02617 Tue 10172 B02617 Wed 11263 B02682 Fri 11938 B02682 Mon 8819 B02682 Sat 11141 B02682 Sun 8688 B02682 Thu 11678 B02682 Tue 9075 B02682 Wed 10092 B02764 Fri 36308 B02764 Mon 26927 B02764 Sat 33940 B02764 Sun 26929 B02764 Thu 35387 B02764 Tue 27569 B02764 Wed 30230 B02765 Fri 3893 B02765 Mon 2810 B02765 Sat 3612 B02765 Sun 2566 B02765 Thu 3646 B02765 Tue 2896 B02765 Wed 3152 |
We hope this post has been helpful in understanding the Uber Data Analysis use case using MapReduce. In the case of any queries, feel free to comment below and we will get back to you at the earliest. Keep visiting our site www.acadgild.com for more updates on BigData and other technologies.
Hi
I am getting error when i run above code as error
String keys = basement.toString()+ ” “+days[date.getDay()];
Unparseable date: “date”
@SuppressWarnings(“deprecation”)
Pl support . correct syntax
According to the error message there are two possible scenarios for this one might be the date format which you are using is not matching with the date format in the dataset. If the date format is according to the date in the dataset then some other column is getting parsed by the date. Please check and revert.
i too got same error, removed header line from sample input text provided to get it work.
I think we are not finding correct solution, Problem stated is ”
In this problem statement, we will find the days on which each basement has more trips.” whereas solution provided is just giving sum of each day rather finding ON WHICH DAY maximum trip/active vehichle .
So it justs like another word count example.
Pls correct me