Introduction

The challenge for this month’s webinar consisted of two parts:

  1. Using data visualisation to discover which site contains fabricated data.
  2. Using data visualisation to influence the study team to conduct further investigation.

The challenge brief highlighted some properties often found in fabricated data. For the purpose of this submission, attention is restricted to the following three properties:

  1. Visit Regularity: Visit dates are often more variable in real data.
  2. Missing Data: Fraudsters often underestimate missingness.
  3. Digit Preference: Fraudsters often favour round numbers.

“Quick and Dirty” Plots

For the first part of the submission, the focus is on the first part of the challenge - i.e., discovering which site contains fabricated data. To begin with, I am not worrying too much about the aesthetics, and simply producing some very quick plots to help identify the site of interest.

Visit Regularity

To begin with, I investigate the distribution of analysis days by visit and site. There is some double counting in the below, as more than one record exists per visit (i.e., different parameters assessed during each visit), but for this very quick first look, I am simply interested in whether any site jumps out.

# Load Data
ADVS <- read_csv("ADVS.csv") %>%
  filter((ABLFL=="Y"|ANL01FL=="Y")&is.na(DTYPE)==TRUE)

# Analysis Day by Visit and Site
ggplot(ADVS, aes(x = ADY, fill = as.factor(AVISITN))) +
  geom_histogram(
    binwidth = 1,
    position = "identity"
  ) +
  facet_wrap(~ SITEID) +
  theme_minimal()

It is clear from this plot that Site 3 appears different from other sites. It appears that for a given analysis visit, all subjects have the same analysis day listed. This is confirmed by tabulating the data:

# Tabulation of Analysis Visit and Analysis Day for Site 3
table(ADVS[ADVS$SITEID=="SITE03",]$AVISIT, ADVS[ADVS$SITEID=="SITE03",]$ADY)
##            
##              1 29 57 85 113 141 169
##   SCREENING 60  0  0  0   0   0   0
##   WEEK 12    0  0  0 60   0   0   0
##   WEEK 16    0  0  0  0  60   0   0
##   WEEK 20    0  0  0  0   0  60   0
##   WEEK 24    0  0  0  0   0   0  60
##   WEEK 4     0 60  0  0   0   0   0
##   WEEK 8     0  0 60  0   0   0   0

This tabulation also indicates that the same amount of data is collected at each visit (i.e., missingness does not vary over time for Site 3). This is investigated further in the next section.

Missing Data

To further investigate the lack of missingness for Site 3, I count how many measurements are recorded at each analysis visit for each site. Again, there is some double counting here due to more than one parameter being assessed at each visit, but I am simply seeing whether any site jumps out.

# Number of Records at Each Visit by Site
ggplot(ADVS, aes(x = as.factor(AVISITN))) +
  geom_bar() +
  facet_wrap(~SITEID) +
  theme_minimal()

Again, Site 3 shows different behaviour from the other sites. Whereas the level of missingness generally increases over time for other sites, for Site 3 the same amount of data is observed at every visit.

Digit Preference

At this point, it seems evident that Site 3 is the site containing fabricated data. As one further check, I see if any digit preference is exhibited for this site by investigating the distribution of measurements for each parameter and site. This is based on a crude pooling of data across visits, again to get a quick idea of whether Site 3 is showing different behaviour.

# Distribution of Measurements for Each Parameter Across Sites
ggplot(ADVS, aes(x = AVAL, fill=PARAM)) +
  geom_bar() +
  facet_wrap(~SITEID) +
  theme_minimal()

Again, Site 3 jumps out here, with measurements that are multiples of 5 being much more frequently observed.

Convincing The Team to Investigate Further

Convinced that Site 3 is the site containing fabricated data, attention now turns to the second part of the challenge: using visualisations to influence the study team to investigate this further. I again focus on the same three properties of the fabricated data.

Visit Regularity

To highlight to the team that for each analysis visit, all subjects have the visit listed on the same analysis day, I plot analysis day by visit and highlight Site 3. This time around, I focus on cleaning the data more to avoid the double counting across parameters:

# Removing Double Counting Across Parameters
df_uniq <- ADVS %>%
  distinct(SITEID, USUBJID, AVISIT, AVISITN, ADY)

# Isolating The Data For Site 3 to Plot in a Different Colour
df_site3  <- df_uniq %>% filter(SITEID == "SITE03")
df_other  <- df_uniq %>% filter(SITEID != "SITE03")

# Plot
ggplot() +
  # Background: other sites
  geom_jitter(data = df_other,
              aes(x = AVISITN, y = ADY),
              colour = "grey70",
              width = 0.2, height = 0,
              alpha = 0.3) +
  
  # Foreground: Site 3
  geom_jitter(data = df_site3,
              aes(x = AVISITN, y = ADY),
              colour = "red",
              width = 0.2, height = 0,
              alpha = 1) +
  
  scale_x_continuous(
    breaks = sort(unique(df_uniq$AVISITN)),
    labels = unique(df_uniq$AVISIT)
  ) +
  labs(
    title = "Analysis Day (ADY) by Visit and Site",
    subtitle = "<span style='color:red;'>Site 3</span> vs <span style='color:grey70;'>Other Sites</span>: Site 03 has identical ADY values within each visit",
    x = "Visit",
    y = "Analysis Day (ADY)"
  ) +
  theme_minimal() +
  theme(plot.subtitle = element_markdown())

Note that the focus of this plot is to highlight the analysis dates for Site 3, and that these do not exhibit the expected level of variability. The plot does not allow us to identify whether any other sites exhibit unusual behaviour in analysis days, as all other sites are grouped together. If we wanted the possibility to investigate analysis days on a site-by-site basis, interactivity could be introduced that allows us to select which site to highlight in red.

Missing Data

Next, attention is turned to convincing the team that Site 3 does not exhibit the anticipated level of missing data over time. For this, heatmaps are produced to display whether a subject does or does not have observed data for each visit. These plots are replicated for both parameters, to reflect that the unusual behaviour of Site 3 is seen for each.

# Function to Generate Plot for a Given Parameter
missplot <- function(paramcd, param) {
  
  # Observed Data
  df_obs <- ADVS %>%
    filter(PARAMCD == paramcd) %>%
    distinct(SITEID, USUBJID, AVISIT, AVISITN) %>%
    mutate(Observed = 1)
  
  # Expected Structure
  subjs <- ADVS %>%
    distinct(SITEID, USUBJID)
  
  visits <- ADVS %>%
    distinct(AVISITN, AVISIT) %>%
    arrange(AVISITN)
  
  # Full Grid
  full_grid <- subjs %>%
    crossing(visits)
  
  # Merge Observed Data
  df_full <- full_grid %>%
    left_join(
      df_obs,
      by = c("SITEID", "USUBJID", "AVISITN", "AVISIT")
    ) %>%
    mutate(Observed = ifelse(is.na(Observed), 0, 1))
  
  # Sorting Subjects Within Site for Given Parameter
  # Based on Number of Non-Missing Visits
  df_full <- df_full %>%
    group_by(SITEID, USUBJID) %>%
    mutate(total_complete = sum(Observed)) %>%
    ungroup() %>%
    arrange(SITEID, desc(total_complete), USUBJID)
  
  # Plot
  ggplot(df_full,
         aes(x = factor(AVISITN, levels = visits$AVISITN),
             y = reorder(USUBJID, -total_complete),
             fill = factor(Observed))) +
    
    geom_tile() +
    
    facet_wrap(~SITEID, scales = "free_y") +
    
    scale_fill_manual(
      values = c("0" = "grey80", "1" = "steelblue"),
      labels = c("Missing", "Observed"),
      name = ""
    ) +
    
    scale_x_discrete(
      labels = visits$AVISIT
    ) +
    
    labs(
      title = paste("Visit Completeness by Site -", param),
      subtitle = "Within each site, each row represents one subject; Site 3 shows no missing visits",
      x = "Visit",
    ) +
    
    theme_minimal() +
    theme(
      strip.text = element_text(face = "bold"),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      axis.text.x = element_text(angle = 67.5, hjust = 1),
      axis.title.y = element_blank(),
    )
}

# DIABP plot
missplot(
  paramcd = "DIABP",
  param = "Diastolic Blood Pressure (DIABP)"
)

# SYSBP plot
missplot(
  paramcd = "SYSBP",
  param = "Systolic Blood Pressure (SYSBP)"
)

Digit Preference

Finally, as a final way to try and convince the team to investigate Site 3 further, I produce cleaner versions of the histograms of the measurement scores for each parameter.

# Grouping All Other Sites
DP_plot <- ADVS %>%
  filter(PARAMCD %in% c("SYSBP", "DIABP")) %>%
  mutate(SiteGroup = ifelse(SITEID == "SITE03", "Site 3", "Other Sites"))

# Plot
ggplot(DP_plot, aes(x = AVAL, fill = SiteGroup)) +
  geom_histogram(binwidth = 1, position = "identity", alpha = 0.6) +
  facet_wrap(~ PARAM, scales = "free") +
  scale_fill_manual(values = c("Other Sites" = "grey70", "Site 3" = "red")) +
  labs(
    title = "Distribution of Measurements by Parameter",
    subtitle = "Site 3 shows clustering at multiples of 5",
    x = "Measurement",
    y = "Frequency",
    fill = ""
  ) +
  theme_minimal()

Here, the intention is to highlight the clustering at multiples of 5 for Site 3. The other sites are all grouped together, giving an idea of the distribution of typical scores based on a crude pooling of measurements. This does not allow us to isolate the distributions of scores in the individual sites; separate distributions per site could be displayed if this was of interest.