Problem Set 2: Data Wrangling

Note: The due is extended to March 17th since the assignment is posted a day late.

Download the data below for this exercise!

Data: evinwa.csv

Background

Automobile manufacturers are transitioning into producing more Electric Vehicles (EVs). EU lawmakers even passed a law that bans the sale of new petrol and diesel cars in the European Union from 2035 to speed up the switch to EVs. EVs are becoming more popular, and more EVs options are available nowadays.

In this problem set, you will explore a dataset that has information on EV car registration in Washington state from 2011 to 2023. The variables in this data are described below.

Name	Description
`vin`	VIN number
`county`	county name
`city`	city name
`postalcode`	Postal code
`modelyear`	Year where the car is intended to be launched
`make`	Car manufacturer
`model`	Car model
`evtype`	EV car type (EV or hybrid)
`range`	Maximum driving range with full charge

Question 1

Load the data using the read_csv function (loaded together with the tidyverse package) and save it as ev_wa; assigning this will make it automatically tibble. How many electric vehicles are registered in WA since 2011 (modelyear)?

Also, use dplyr functions to create a table with the total number of registered BEVs and PHEVs by model year, and employ the function knitr::kable() on this table to have a nicely formatted table produced in the knitted output (also with nice labels).

Question 2

Create a bar plot to present which county has the highest EV registered numbers over the years.

Question 3

Create a histogram of ranges for BEVs that are manufactured after 2017 (modelyear>2017) only. Once you create the histogram, in the text, describe the shape of the histogram and tell the reader if most BEVs have ranges greater than 200 miles. Also, which range seems to have the highest count in the histogram looking at the plot? Is the plot skewed to the left or right (or evenly distributed)?

Question 4

Now, you are in the BEV market and were wondering which BEV model has the longest range. Create a bar plot using the ggplot package that shows the maximum value of the range for each car model. On x-axis and y=axis, it should show the maximum range and the car model, respectively. The final figure should reorder car models by its range; the highest range car model should appear on top. Furthermore, let’s color the bars by automakers (you can pick your own color)!

Question 5

We are going to create the difference in the car registration by model year between US and non-US brands. US brands in the dataset are Tesla, Chevrolet, Chrysler, Cadillac, and Ford (based on my quick Google search). Once you create the new vector, let’s create a bar plot to see the trend by model year. What are the trends? Are US brands being registered more and more?

Tips: Once you count the total number of registration for the US brand and non-US brand, you can use spread() function from the tidyr package; we have not covered the spread() function in the class. If you want to follow this route, you can write out as spread(usauto, n) (here n is the total number of registration), for example. This can help calculating the difference between US and non-US brand registrations much easier.