Your Survey Data Is Probably Not Ready to Analyse

Nova Data Analytics
Mar 12
5 min read

Here is what to do before you run a single test — and why each step matters

You have collected your data. The export is sitting in a spreadsheet. You are ready to run your correlations and write up your findings. Not quite.

Every result you produce is only as defensible as the data underneath it. And in our experience working with postgraduate researchers, most raw survey datasets have problems — not because the researcher did anything wrong, but because survey platforms do not clean data for you.

This post walks through the preparation workflow we used for a Masters-level study on Perceived Supervisor Support, Burnout, and Intention to Leave among 350 public sector employees in Gauteng. We explain each step plainly: what it is, why it was done, and what happens to your results if you skip it. The full Python notebook and dataset are on our GitHub.

Step 1 — Clean the Data First

Our raw dataset had 353 responses. Before touching a single statistic, we worked through eight cleaning checks. Here is what we found and why each one mattered:

Duplicates: 3 respondents submitted twice. This is not just an inconvenience. A duplicate row adds perfectly correlated variance to every item pair simultaneously, artificially inflating Cronbach's alpha and making your scale appear more reliable than it genuinely is. We used two methods: platform-flagged IDs, and exact response-pattern matching across all 16 items. An additional 2 unmarked Likert-scale duplicates were removed. This could have been due to: someone submitting twice, a system glitch, copying or a bot response.
Out-of-range values: 5 cells contained impossible values — a 6, a 0, a 7, an 8, and a −1 on a 1–5 scale. Set to missing. There is no defensible way to infer what the respondent intended.
Inconsistent labels: 4 gender entries used inconsistent formats. Uncorrected, any group comparison silently produces wrong frequency counts.
Missing data: We ran Little's MCAR test using pyampute before deciding how to handle missing items. The result confirmed MCAR, missingness was completely random, and the overall non-response rate was 0.33%. At that level, imputation adds methodological assumptions without any analytical benefit. We applied listwise deletion, which is unbiased under MCAR and requires no defence (Schafer & Graham, 2002).
Straightlining: 3 respondents answered every item identically, producing a within-respondent standard deviation of zero. The reason for removal is mathematical, not behavioural. A constant row has undefined correlations; you cannot compute Pearson's r or Spearman's rho when one variable has no variance. It also distorts Cronbach's alpha by suppressing item variance without reducing covariance. Whether these respondents genuinely experienced uniform conditions or were disengaged is beside the point; their data is statistically unusable.
Central tendency bias: 2 respondents chose the midpoint for more than 85% of items. At that level, the pattern reflects avoidance, not neutral opinion, and introduces the same variance suppression problem as straightlining.
Acquiescence bias — flagged, not removed: We identified 2 respondents who scored above 4.0 on both PSS and Burnout simultaneously. This pattern is statistically rare given the strong negative correlation between these constructs, but it is not impossible. A person can have a genuinely supportive supervisor and still experience burnout due to workload, role demands, or organisational pressures entirely outside that relationship. Without a marker variable or a social desirability scale, we cannot distinguish genuine experience from response bias. These respondents were flagged and retained. Common method bias is addressed at the dataset level in the validity section using Harman's single-factor test.
Blank respondents: no respondents completed demographics only. None were removed.

After cleaning: 320 valid responses. A retention rate of 90.7%.

Step 2 — Check Your Scales, Not Just Your Data

Many researchers run Cronbach's alpha and move on. Alpha tells you whether items move together, but it cannot tell you whether that consistency is genuine or whether it is being propped up by a couple of dominant items while the others contribute little. We ran five checks, each targeting something the others cannot see:

Cronbach's alpha — overall internal consistency. The minimum standard, not the complete picture.
Inter-item correlations — are all item pairs contributing uniformly, or is alpha inflated by a few redundant items?
Corrected item-total correlations (CITC) — does each individual item genuinely belong in this scale? Below r = 0.30 flags a weak contributor.
Alpha-if-item-deleted — would removing any specific item improve reliability? The action step when CITC raises a concern.
Split-half reliability (Spearman-Brown) — does consistency hold when the scale is split into two independent halves?

All three constructs returned excellent results: PSS α = 0.949, Burnout α = 0.939, ITL α = 0.880. All corrected item-total correlations exceeded 0.30. No item improved alpha if removed. High reliability matters for Part 2 — it means observed correlations reflect true construct relationships, not measurement error attenuating them toward zero.

Some output from the reliability analysis.

Step 3 — Validity Is Not the Same as Reliability

A scale can consistently measure the wrong thing. Reliability alone cannot rule that out. We ran five validity checks, each addressing a threat that the reliability analysis cannot detect:

Convergent validity — average inter-item correlations confirmed items within each construct converge without being redundant (Clark & Watson, 1995 target: 0.15–0.50).
Discriminant validity — inter-construct correlations were all below 0.85, confirming PSS, Burnout, and ITL are empirically distinct.
Common method bias (Harman's test) — because all items came from the same self-report survey, variance could be driven by response style rather than real differences. The first unrotated factor explained less than 50% of variance. CMB is not a major concern here, and a marker variable approach in the Part 2 CFA will provide a stronger test.
Known-groups validity — Burnout scores differed significantly across tenure groups in the direction theory predicts, confirming the scale responds to real-world variation in the expected way.

Why This Comes Before Everything Else

The researchers who skip these steps do not always produce obviously wrong results. They produce results that look plausible until someone asks: how do you know your alpha was not inflated by duplicates? How do you know your missing data did not introduce bias? How do you know your correlations are not a common method artefact?

Working through this systematically, and writing down every decision with its rationale, is what makes a dissertation examinable. Part 2 of this case study applies Confirmatory Factor Analysis and hypothesis testing directly to this clean, validated dataset.

GitHub Repo

References

Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177.

Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7(3), 309–319.

Vallat, R. (2018). Pingouin: Statistics in Python. Journal of Open Source Software, 3(31), 1026.

Step 1 — Clean the Data First

Step 2 — Check Your Scales, Not Just Your Data

Step 3 — Validity Is Not the Same as Reliability

Why This Comes Before Everything Else

References

Comments