Use the given data to detect the credit card fraud. There are three tasks associated with this analysis:
cc-sub.csv
creditcard.csv.gz
readr::read_csv("https://stat432.org/data/creditcard.csv.gz")
to read the full data directly.Please refer to the source documentation for information on data collect and a data dictionary. The response was altered.
0
is now labeled genuine
1
is now labeled fraud
The following code is available to show how the quiz data was created, but please use the .csv
linked above. If you choose to use the full data, you will need to run the line below that alters the response from 0
and 1
to genuine
and fraud
, unless you prefer 0
and 1
.
# load pacakges
library("tidyverse")
library("caret")
library("gbm")
library("ROSE")
# extract file obtained from Kaggle
# https://www.kaggle.com/mlg-ulb/creditcardfraud
untar("creditcardfraud.zip")
# create remote readable compressed file
system("gzip creditcard.csv")
# from gz file
cc = read_csv("creditcard.csv.gz")
# verify data
nrow(cc) == 284807
# make response a factor with names instead of numbers
cc$Class = factor(ifelse(cc$Class == 0, "genuine", "fraud"))
# subset for efficiency and PL
set.seed(42)
sub_idx = sample(nrow(cc), size = 50000)
cc_sub = cc[sub_idx, ]
# write subset to disk
write_csv(cc_sub, "cc-sub.csv")
For this analysis, do the following:
Submit a .zip
file to Compass that contains:
.Rmd
file that is your IMRAD.
data/
which contains heart-disease.csv
..html
file that is the result of knitting your .Rmd
file.The zip file should contain no other files. (Whether or not these two files are within another folder does not matter.)
Submit your .zip
file to the correct assignment on Compass2g. You are granted an unlimited number of submissions. Only your final submission will be graded.
We assume that your R
, R
packages, and RStudio are all up-to-date. (Or at least as recent as the versions found on RStudio Cloud.) You’ve been warned.
Your code will be graded based on its style. We don’t expect you to have a mature coding style, so we have a list of rules which must be followed.
The following will be explicitly checked for in your code:
==
, +
, -
, <-
, etc.) should always be surrounded by spaces.
:
, ::
, $
, [
, [[
, ]
, ]]
^
: Use x ^ 2
instead of x^2
.<-
or =
, not both.
<-
operator, you will need to replace the =
operator in the given code.T
or F
.;
.attach()
function.The following are suggested, but will not be directly assessed:
.
, or capital letters in variable and function names.for (i in 1:10)
for(i in 1:10)
mean(x)
mean (x)
predict()
function.)
Much of this is derived from the tidyverse
style guide. If you follow the tidyverse
guide, be aware of our use of ^
and =
.
There will be a PL quiz associated with this analysis to check some of the “objective” numeric results of your analysis.
After submission of the analysis, an example “solution” will be released. In addition, a set of reflection questions will be released. By comparing your submitted analysis to the “solution” together with the reflection questions, you will write a short self-assessment of you analysis.
This analysis is worth a total of 10 points.
.Rmd
and .html
(4 points)
.Rmd
(0 - 1 - 2).html
(0 - 1 - 2)
Failure to submit the correct files will results in 0 points for the IMRAD.
Quiz grading will be similar to regular quizzes.
Grading of the self reflection will largely be based on completion. A template will be provided after submission of the analysis.
The late policy will apply to each individual task. See above for due dates.
Late submissions for both will be accepted up to 48 hours after the initial deadline.
If you submit multiple attempts, the final attempt will be graded. If your first submission is on time, but your final submission is late, you will incur the late submission penalty.