45580

How to use R to check data consistency (make sure no contradiction between case and value)?

Let's say I have:

Person Movie Rating Sally Titanic 4 Bill Titanic 4 Rob Titanic 4 Sue Cars 8 Alex Cars **9** Bob Cars 8

As you can see, there is a contradiction for Alex. All the same movies should have the same ranking, but there was a data error entry for Alex. How can I use R to solve this? I've been thinking about it for a while, but I can't figure it out. Do I have to just do it manually in excel or something? Is there a command on R that will return all the cases where there are data contradictions between two columns?

Perhaps I could have R do a boolean check if all the Movie cases match the first rating of its first iteration? For all that returns "no," I can go look at it manually? How would I write this function?

Thanks

Answer1:

Here's a data.table solution

Define the function

Myfunc <- function(x) { temp <- table(x) names(temp)[which.max(temp)] } library(data.table)

Create a column with the correct rating (by reference)

setDT(df)[, CorrectRating := Myfunc(Rating), Movie][] # Person Movie Rating CorrectRating # 1: Sally Titanic 4 4 # 2: Bill Titanic 4 4 # 3: Rob Titanic 4 4 # 4: Sue Cars 8 8 # 5: Alex Cars 9 8 # 6: Bob Cars 8 8

Or If you want to remove the "bad" ratings

df[Rating == CorrectRating][] # Person Movie Rating CorrectRating # 1: Sally Titanic 4 4 # 2: Bill Titanic 4 4 # 3: Rob Titanic 4 4 # 4: Sue Cars 8 8 # 5: Bob Cars 8 8

Answer2:

It looks like, within each group defined by "Movie", you're looking for any instances of Rating that are not the same as the most common value.

You can solve this using dplyr (which is good at "group by one column, then perform an operation within each group), along with the "Mode" function defined in this answer that finds the most common item in a vector:

Mode <- function(x) { ux <- unique(x) ux[which.max(tabulate(match(x, ux)))] } library(dplyr) dat %>% group_by(Movie) %>% filter(Rating != Mode(Rating))

This finds all the cases where a row does not agree with the rest of the group. If you instead want to remove them, you can do:

newdat <- dat %>% group_by(Movie) %>% filter(Rating == Mode(Rating))

If you want to fix them, do

newdat <- dat %>% group_by(Movie) %>% mutate(Rating = Mode(Rating))

You can test the above with a reproducible version of your data:

dat <- data.frame(Person = c("Sally", "Bill", "Rob", "Sue", "Alex", "Bob"), Movie = rep(c("Titanic", "Cars"), each = 3), Rating = c(4, 4, 4, 8, 9, 8))

Answer3:

If the goal is to see if all the values within a group are the same (or if there are some differences) then this can be a simple application of tapply (or aggregate, etc.) used with a function like var (or compute the range). If all the values are the same then the variance and range will be 0. If it is any other value (outside of rounding error) then there must be a value that is different. The which function can help identify the group/individual.

tapply(dat$Rating, dat$Movie, FUN=var) which(.Last.value > 0.00001) tapply(dat$Rating, dat$Movie, FUN=function(x)diff(range(x))) which(.Last.value != 0) which( abs(dat$Rating - ave(dat$Rating, dat$Movie)) > 0) which.max( abs(dat$Rating - ave(dat$Rating, dat$Movie)) ) dat[.Last.value,]

Answer4:

I would add a variable for mode so I can see if there is anything weird going on with the data, like missing data, text, many different answers instead of the rare anomaly,etc. I used "x" as your dataset

# one of many functions to find mode, could use any other modefunc <- function(x){ names(table(x))[table(x)==max(table(x))] } # add variable for mode split by Movie x$mode <- ave(x = x$Rating,x$Movie,FUN = modefunc) # do whatever you want with the records that are different x[x$Rating != x$mode, ]

If you want another function for mode, try other functions for mode

Recommend

  • NSIS decompiler
  • How to 'create temp table as select' in Slick?
  • using System.Speech.Synthesis with Windows10 universal app (XAML-C#)
  • How load cv2.KeyPoint and Descriptors correctly on OpenCV 3 with Python 3?
  • GAE: Way to get reference to an HttpSession from its ID?
  • Spring boot 2.0.0.M4 required a bean named 'entityManagerFactory' that could not be found
  • Web.config system.webserver errors
  • RxJava debounce by arbitrary value
  • What is the purpose of TaskExecutor in spring?
  • D3 get axis values on zoom event
  • C: Incompatible pointer type initializing
  • Groovy: Unexpected token “:”
  • Installed module is empty
  • Combining SpatialPolygonsDataFrame of two neighbour countries
  • Replace value with Factor in r data.table
  • Paperclip, set path outside of rails root folder
  • How to access EntityManager inside Entity class in EJB3
  • Repeat a vertical line on every page in Report Builder / SSRS
  • Email format validation in mvc3 view
  • C# - Is there a limit to the size of an httpWebRequest stream?
  • How to add date and time under each post in guestbook in google app engine
  • Does CUDA 5 support STL or THRUST inside the device code?
  • How to get next/previous record number?
  • Function pointer “assignment from incompatible pointer type” only when using vararg ellipsis
  • Why winpcap requires both .lib and .dll to run?
  • 0x202A in filename: Why?
  • VB.net deserialize, JSON Conversion from type 'Dictionary(Of String,Object)' to type '
  • Comma separated Values
  • Error creating VM instance in Google Compute Engine
  • Why can't I rebase on to an ancestor of source changesets if on a different branch?
  • Hits per day in Google Big Query
  • coudnt use logback because of log4j
  • how does django model after text[] in postgresql [duplicate]
  • File not found error Google Drive API
  • reshape alternating columns in less time and using less memory
  • Binding checkboxes to object values in AngularJs
  • Unable to use reactive element in my shiny app
  • Net Present Value in Excel for Grouped Recurring CF
  • jQuery Masonry / Isotope and fluid images: Momentary overlap on window resize
  • How to load view controller without button in storyboard?