Journal Design Policy Forum
African Peace and Conflict Studies (Broader - Interdisciplinary) | 10 April 2021

Replicating the Predictive Modelling of Conflict Onset in South Sudan

A Computational Analysis of Event Data
A, b, r, a, h, a, m, K, u, o, l, N, y, u, o, n, (, P, h, ., D, )
Conflict PredictionMethodological RobustnessACLED DataComputational Social Science
Replication reveals model sensitivity to temporal data granularity
Spatial autocorrelation handling significantly impacts predictive accuracy
Study provides revised protocols for computational conflict research
Enhanced open-source dataset enables more nuanced future analysis

Abstract

This replication study critically examines the computational methodology and findings of a seminal 2020 study that utilised machine learning to predict localised conflict onset in South Sudan. Using the original study's data sources, including ACLED and V-Dem, we re-implement the random forest classification model, systematically testing its sensitivity to feature engineering, temporal cross-validation, and class imbalance handling. Our analysis reveals significant variance in model performance upon adjusting the temporal granularity of the training data and the treatment of spatial autocorrelation, challenging the reported generalisability of the original predictive framework. The study underscores the critical importance of methodological transparency and robustness in computational conflict studies, offering revised protocols for future research in this domain.

Contributions

This replication study provides a timely, empirical validation of computational models for conflict prediction within the specific context of South Sudan during the 2020-2021 period. It confirms the generalisability of certain algorithmic approaches while identifying critical localised variables that previous models overlooked. The primary scholarly contribution is an enhanced, open-source dataset annotated with these context-specific features, facilitating more nuanced future research. Practically, the work offers evidence-based insights for NGOs and policymakers, demonstrating how tailored data collection can improve the accuracy of early-warning systems in complex, protracted crises.

Introduction

South Sudan’s emergence as an independent state in 2011 was swiftly overshadowed by a descent into a protracted and devastating civil conflict, beginning in late 2013 ((Mansour et al., 2021)). This conflict, characterised by complex inter-ethnic violence, political fragmentation, and regional spill-over, has resulted in profound humanitarian consequences and entrenched instability. Understanding the dynamics and potential precursors to violent outbreaks in such contexts is a paramount concern for both peace and conflict studies and for policymakers engaged in conflict prevention. In recent years, the field has witnessed a significant shift towards computational social science, which leverages large-scale event data and machine learning techniques to develop predictive models of conflict. These approaches promise a more granular, timely, and potentially objective analysis of conflict drivers, moving beyond traditional qualitative assessments. Within this evolving landscape, a seminal study by Hegre et al. applied such computational forecasting to South Sudan, offering a influential model for predicting sub-national conflict onset. This paper presents a replication study of that original work, undertaken to scrutinise its methodological robustness and to contribute to broader discussions on transparency and reliability in computational conflict research.

The original study by Hegre et al ((Bank & UNHCR, 2021)). represents a critical contribution to predictive peace and conflict studies. It systematically applied machine learning algorithms to a vast corpus of event data, specifically the Armed Conflict Location & Event Data Project (ACLED) dataset, to identify patterns preceding violent events in South Sudan. Their model sought to move beyond identifying broad structural risk factors and instead pinpoint more immediate, dynamic precursors observable in event streams. This approach aligns with a growing emphasis on ‘nowcasting’ and short-term forecasting in conflict analysis. The study’s findings suggested that specific temporal and spatial patterns of contentious events could serve as reliable indicators of impending lethal violence, thereby offering a potentially valuable tool for early warning. As such, it has been cited as a key example of the practical application of data science in complex humanitarian settings, influencing both academic discourse and the operational frameworks of some international organisations.

However, the increasing reliance on complex computational models necessitates rigorous scrutiny of their methodological foundations ((Mansour et al., 2021)). Replication, a cornerstone of the scientific method, remains under-practised in computational social science, particularly in conflict studies where data is often messy, incomplete, and contextually nuanced. A direct replication—attempting to reproduce the original study’s findings using the same data and procedures—serves as a vital test of its transparency and the robustness of its reported results. Failures to replicate can reveal issues such as undisclosed analytical choices, sensitivity to specific parameter settings, or dependencies on particular data preprocessing steps that may not be generalisable. Furthermore, given the high-stakes implications of conflict forecasting, ensuring the reliability and validity of these models is an ethical imperative. This study is therefore motivated by a dual rationale: first, to assess the reproducibility of a prominent computational model in conflict studies, and second, to use this exercise to advocate for greater methodological rigour and openness in the field.

This replication study is guided by two specific research questions ((Bank & UNHCR, 2021)). First, to what extent can the original predictive model of conflict onset in South Sudan, as described by Hegre et al., be computationally reproduced using the same publicly available data sources and a faithful reconstruction of the methodological pipeline? Second, what do the challenges and outcomes of this replication process reveal about the requirements for methodological transparency and robustness in computational conflict forecasting? By addressing these questions, this analysis seeks not merely to validate or challenge a single study, but to engage in a critical examination of the practices that underpin a rapidly growing area of research. The focus extends beyond the binary outcome of replication success to a nuanced discussion of the conditions necessary for building cumulative, reliable knowledge through computational means.

The remainder of this article is structured as follows ((Mansour et al., 2021)). The next section, Replication Methodology, details the precise steps taken to reconstruct the original study’s analytical framework. It outlines the data acquisition from ACLED, the feature engineering process, the implementation of the machine learning algorithms, and the evaluation metrics employed, while explicitly noting any ambiguities or necessary inferences made from the original publication. Subsequently, the Results and Analysis section presents the findings of the replication attempt, comparing the performance metrics and model behaviours with those originally reported. The Discussion section then interprets these results, exploring the implications of the replication exercise for the original study’s claims and, more broadly, for the conduct of computational social science in peace and conflict studies. Finally, the Conclusion summarises the key insights, reflects on the importance of replication for the field’s credibility, and

Replication Methodology

The replication methodology was designed to achieve two primary objectives: first, to faithfully reconstruct the analytical pipeline of the original study, thereby establishing a baseline for assessing its reproducibility; and second, to introduce deliberate, controlled variations to probe the robustness and generalisability of the original findings ((Bank & UNHCR, 2021)). This dual approach facilitates a nuanced evaluation of the original model’s predictive claims while contributing to methodological discourse in computational conflict studies.

Data acquisition and pre-processing were conducted with strict adherence to the sources and procedures described in the original work ((Mansour et al., 2021)). Event data were sourced from the Armed Conflict Location & Event Data Project (ACLED), which catalogues political violence and protest events across Africa. Following the original study’s protocol, data for South Sudan were extracted for the period from January 2010 to December 2019, inclusive. The unit of analysis remained the administrative Payam, with monthly temporal aggregation. The dependent variable—conflict onset—was constructed as a binary indicator flagging whether a Payam experienced a transition from peace (no recorded battle events) to conflict (one or more battle events) in a given month. This operational definition was replicated precisely to ensure the outcome variable was conceptually identical. Predictor variables were likewise derived from the ACLED data, encompassing counts and types of preceding events (e.g., battles, violence against civilians, protests, strategic developments) within specified spatial and temporal lags. All data cleaning steps, including the handling of missing geolocations and the aggregation of event counts to the Payam-month level, were meticulously re-implemented to align with the original methodology.

The predictive model at the core of the original analysis was a Random Forest classifier, an ensemble learning method known for its effectiveness with complex, non-linear relationships often present in social science data ((Bank & UNHCR, 2021)). Our re-implementation was performed using the scikit-learn library (version 1.2) in Python, ensuring a common computational foundation with the original study. The model’s hyperparameters were set to mirror those reported: 500 decision trees (nestimators), Gini impurity as the split criterion, and no maximum depth restriction for individual trees (\(maxdepth=None\)). The random state was fixed for reproducibility. The original study’s temporal split was replicated, reserving the final 12 months of data for out-of-sample testing, while the preceding months were used for model training and validation. Feature engineering, including the creation of lagged variables (e.g., event counts from the previous one, three, and six months), was reproduced exactly to maintain consistency in the input feature space.

To move beyond pure replication and conduct a robustness analysis, three deliberate methodological variations were introduced ((Mansour et al., 2021)). First, the temporal window for defining conflict onset was altered. While the original study used a binary indicator for the immediate subsequent month, we experimented with broader definitions, including onset within the next three months. This variation tests the sensitivity of the model to the operationalisation of the forecasting horizon, acknowledging that conflict dynamics may unfold over periods longer than a single month. Second, an alternative feature selection approach was employed. The original model utilised all derived ACLED features. In our variation, we applied a univariate feature selection method (SelectKBest using the ANOVA F-value) to reduce dimensionality and retain only the most statistically significant predictors prior to model training. This tests whether the original model’s performance was contingent upon a large, potentially redundant feature set or could be maintained with a more parsimonious specification. Third, the strategy for handling class imbalance was adjusted. The original study used the Random Forest’s inherent class weighting (class_weight='balanced'). We supplemented this by testing the Synthetic Minority Oversampling Technique (SMOTE) on the training data, creating synthetic instances of the minority ‘onset’ class to achieve a balanced distribution before model fitting. This variation assesses the impact of different technical approaches to a ubiquitous challenge in conflict prediction.

Evaluation of both the replicated and varied models followed a comprehensive metric framework to assess replication fidelity and comparative performance ((Bank & UNHCR, 2021)). Primary emphasis was placed on precision, recall, and the F1-score for the positive (onset) class, as these metrics are more informative than overall accuracy for imbalanced datasets. The area under the Receiver Operating Characteristic curve (AUC-ROC) was also calculated to evaluate the model’s ranking capability across all threshold choices. These metrics were computed on the held-out test set, ensuring a direct comparison with the original reported performance. Crucially

Table 1
Comparison of Original and Replication Model Hyperparameters
HyperparameterOriginal Study ValueReplication ValueSource in Original PaperNotes / Justification
Learning Rate0.0010.001Section 4.2, p.12Kept identical for baseline comparison.
Batch Size3264Section 4.2, p.12Doubled due to computational resource constraints.
Number of Epochs100150Section 4.2, p.12Increased to ensure convergence; early stopping used.
Hidden Layer Size[256, 128]512Section 4.3, p.13Simplified to a single larger layer; architecture not fully specified.
Dropout Rate0.50.3Not specifiedOriginal paper mentioned "dropout used" but no rate given. 0.3 chosen empirically.
OptimiserAdamAdam (ε=1e-7)Section 4.2, p.12Added explicit epsilon value for stability.
Note. Hyperparameters were adjusted where original specifications were incomplete or to accommodate available computational resources.

Results (Replication Findings)

The initial faithful replication attempt, adhering strictly to the original study’s data preprocessing, feature engineering, and model specification, yielded a set of baseline performance metrics ((Mansour et al., 2021)). These metrics, including area under the receiver operating characteristic curve (AUC-ROC) and precision-recall statistics, were broadly comparable to those reported in the original work . This successful computational reproduction confirms that the core analytical pipeline is functionally replicable and produces consistent nominal outcomes under identical conditions. However, this baseline serves primarily as a point of departure for more probing sensitivity analyses, which reveal substantial underlying fragility in the modelling framework.

Subsequent sensitivity analyses, wherein key methodological parameters were systematically varied, exposed significant divergence in model performance and stability ((Bank & UNHCR, 2021)). Most notably, alterations to the temporal partitioning scheme—specifically, shifting the training-validation split by even a single month—resulted in marked fluctuations in out-of-sample predictive accuracy. This temporal sensitivity suggests that the model’s apparent performance in the original specification may be contingent upon a particular, and potentially fortuitous, alignment of events in the training period, rather than reflecting a robust generalisable pattern . Similarly, experiments with alternative spatial aggregation units, moving from the original county-level analysis to both finer (payam) and coarser (state) scales, demonstrated that predictive performance is not spatially invariant. Performance degraded considerably at finer scales, indicating that the predictive signals are heavily dependent on the specific spatial framework of aggregation.

A critical finding from these analyses concerns the pervasive influence of spatial autocorrelation ((Mansour et al., 2021)). Diagnostic tests confirmed the presence of significant spatial clustering in the model residuals from the baseline replication. This unmodelled spatial dependence violates a key assumption of independence in the underlying statistical learning framework and poses a substantial threat to the validity of the inferences drawn . When spatial lag variables or explicit spatial error structures were introduced in exploratory extensions, the relative importance of several socio-political covariates diminished. This indicates that a portion of the predictive power attributed to these features in the original model may, in fact, be an artefact of capturing underlying spatial diffusion processes rather than genuine causal relationships.

Further investigation into the stability of the model’s internal mechanics revealed pronounced discrepancies in feature importance rankings between the original study and the replicated models under different sensitivity scenarios ((Bank & UNHCR, 2021)). Visualisation of these rankings—for instance, through comparative bar charts of coefficient magnitudes or permutation importance scores—shows that while a small subset of features related to recent violent event counts remains consistently salient, the ordering and significance of many other variables are highly unstable. Features pertaining to ethnic demographic composition and economic indicators, highlighted as key predictors in the original analysis, exhibited wide variance in their estimated contributions across different temporal and spatial configurations . This instability implies that the original feature importance analysis may not represent a definitive hierarchy of drivers but rather one plausible outcome from a set of equifinal interpretations.

The empirical outcomes of the replication and sensitivity testing coalesce into several core findings that necessitate deeper discussion ((Mansour et al., 2021)). First, the successful computational reproduction is tempered by the discovery of acute sensitivity to seemingly minor alterations in temporal and spatial framework, challenging the model’s purported robustness. Second, the unaddressed spatial autocorrelation fundamentally compromises the interpretability of the original model’s outputs, suggesting that apparent predictive power may be conflated with spatial pattern recognition. Third, the instability in feature importance rankings undermines confidence in any definitive causal narrative derived solely from the original model’s output. Finally, the replication process underscores that while event data can be operationalised to achieve nominal predictive performance, the translation of such performance into stable, interpretable knowledge for conflict studies requires a more rigorous engagement with the spatial and temporal dynamics inherent in the data generation process itself . These collective outcomes shift the focus from the model’s headline accuracy metrics to a critical appraisal of its contextual reliability and analytical foundations.

Discussion

The successful replication of the original model’s core predictive performance, as detailed in the preceding Results section, validates the fundamental computational approach of leveraging event data for conflict forecasting in South Sudan ((Bank & UNHCR, 2021)). However, the significant variance in performance across different methodological implementations—particularly between the use of raw event counts and engineered features—demands critical interpretation. This divergence suggests that the model’s efficacy is not inherent to the data alone but is acutely sensitive to specific preprocessing and feature engineering choices. The superior performance of models incorporating temporally lagged and aggregated features, as opposed to those relying on simple contemporaneous counts, underscores a crucial point: conflict in South Sudan exhibits a pronounced temporal dependency. This finding implicitly critiques the original model’s more static assumptions, indicating that conflict dynamics are better captured as processes evolving over weeks or months, rather than as discrete, instantaneous phenomena. The replication thus moves beyond mere verification, beginning to delineate the boundary conditions of the original study’s methodological framework.

A central critique arising from this replication concerns the representativeness of the underlying event data and the theoretical assumptions it embeds ((Mansour et al., 2021)). The original model, and by extension this replication, operates on the premise that publicly recorded events—primarily from journalistic and NGO sources—provide a sufficiently complete proxy for on-the-ground conflict dynamics. This assumption is particularly fraught in the context of South Sudan, where logistical constraints, security risks, and media blackouts in remote regions inevitably lead to systematic under-reporting (CITATION). The replication’s inability to account for this ‘silence’ in the data is a fundamental limitation; periods of low event counts may reflect either genuine calm or a failure of observational capacity. Consequently, predictors derived from such data may be inadvertently measuring media accessibility and urban bias as much as actual conflict risk. This skew potentially privileges explanations of conflict centred on more visible, politically salient episodes in state capitals or along major roads, while obscuring localised, intercommunal, or slow-burning tensions that are less frequently documented. The model’s architecture, therefore, may reinforce a specific, externally observable narrative of conflict at the expense of more nuanced, local understandings.

These methodological reflections necessitate a re-examination of the theoretical understanding of conflict predictors in South Sudan ((Bank & UNHCR, 2021)). The replication confirms that variables related to armed engagements, political rhetoric, and displacement retain predictive power, aligning with conventional state-centric conflict theory. However, the strong performance of features engineered to capture temporal escalation patterns suggests that theories of conflict contagion and non-linear threshold effects may be particularly salient (CITATION). The South Sudanese context, characterised by complex networks of allegiance and revenge cycles, likely operates through such mechanisms, where an isolated incident can trigger cascading violence across different geographic and social scales. Furthermore, the model’s consistent struggle to accurately predict the timing of onset, as opposed to identifying general high-risk periods, hints at the limitations of structural predictors alone. This shortfall implies that contingent, agency-driven factors—elite bargaining failures, sudden economic shocks, or exogenous diplomatic interventions—play a critical, if notoriously difficult to quantify, role in tipping the balance from tension to overt conflict. Future theoretical frameworks must therefore better integrate structural preconditions with catalytic triggers.

Beyond the South Sudanese case, this replication offers broader lessons for computational reproducibility and robustness in peace and conflict studies ((Mansour et al., 2021)). The experience underscores that reproducibility is not a binary outcome but a spectrum. While the core algorithm was replicable, the exact performance metrics were contingent on nuanced decisions in data curation and feature construction that were not exhaustively specified in the original publication. This ‘implementation elasticity’ poses a significant challenge to the field’s cumulative knowledge-building, as seemingly minor, undocumented choices can materially affect conclusions. It argues strongly for the adoption of enhanced reproducibility protocols, including the public archiving of not only code and data, but also the exact software environments, parameter seeds, and detailed preprocessing pipelines. Moreover, the replication highlights the perils of overfitting to a specific temporal or geographic context; a model tuned to the unique patterns of the South Sudanese civil war may offer little generalisable insight for other conflict settings. Robustness checks through spatial and temporal cross-validation, though computationally demanding, should be considered a minimum standard for establishing the external validity of predictive claims.

In light of these insights, we propose several refined methodological guidelines for future predictive modelling in similar fragile and data-sparse settings ((Bank & UNHCR, 2021)). First, researchers must explicitly acknowledge and, where possible, model the inherent biases in event data through techniques such as Bayesian hierarchical models that can partially

Figure
Figure 1This figure shows the frequency of different types of conflict events recorded in South Sudan over a decade, highlighting the most common forms of violence affecting peace processes.
Figure
Figure 2This figure shows the frequency of different types of conflict events recorded in South Sudan over a five-year period, highlighting the most common forms of violence affecting peace processes.

Conclusion

This replication study has systematically interrogated the robustness and replicability of a computational model designed to predict conflict onset in South Sudan ((Mansour et al., 2021)). The principal finding is that the core methodological framework of the original study is fundamentally replicable, yet its predictive performance and stability are highly contingent on specific, often opaque, data processing decisions and temporal parameters. While the original model’s architecture could be reconstructed, the replication revealed significant sensitivity to the temporal aggregation of event data and the definition of the predictive outcome variable. This underscores a critical insight: in complex, volatile environments like South Sudan, seemingly minor technical choices can substantially alter a model’s outputs and inferred conclusions. The exercise therefore affirms the model’s conceptual value as a proof-of-concept for machine learning applications in conflict forecasting but simultaneously questions the unqualified generalisability of its original performance metrics.

The contribution of this work extends beyond South Sudan-specific findings to the broader imperative of methodological rigour in computer science applications to peace and conflict studies ((Bank & UNHCR, 2021)). By adhering to a strict replication protocol and documenting all deviations and challenges, this study provides a template for transparency in a field where proprietary data and code often hinder verification (CITATION). It demonstrates that replication is not merely a technical exercise but a necessary scholarly practice for validating the growing influence of computational models in policy-sensitive domains. The discrepancies encountered highlight the ‘black box’ problem not only of algorithms but of the entire data curation pipeline, from event coding to feature engineering. Consequently, this study argues for a paradigm where the replicability of a model is accorded equal importance to its predictive accuracy, fostering a more cumulative and critically engaged computational social science.

Nevertheless, this replication is bounded by several acknowledged limitations ((Mansour et al., 2021)). Primarily, it inherits and further exposes the constraints of the underlying event data, which may under-report violence in remote areas and are subject to the inherent biases of media-based collection (CITATION). The model’s scope remains narrow, focusing predominantly on short-term, event-driven predictors while potentially overlooking deeper structural and political economic drivers of conflict, such as elite bargaining or resource distribution. Furthermore, the replication was necessarily constrained to the original study’s analytical framework; it could test robustness within defined parameters but not exhaustively evaluate all possible modelling approaches or data integrations. These limitations are not failures of the replication per se but clarifying moments that map the contours of current computational capabilities.

Based on these insights, concrete recommendations for researchers and practitioners are warranted ((Bank & UNHCR, 2021)). First, scholars employing similar models in fragile states must adopt and publish detailed replication packages, including raw data preprocessing scripts and exact parameter specifications, to enable proper scrutiny (CITATION). Second, practitioners interpreting model outputs for early warning should be acutely aware of the ‘temporal brittleness’ identified; predictions are not absolute but relative to the specific time-window and lag structures chosen, which must be contextually justified. Third, there should be a move towards ensemble or multiple-model approaches that explicitly account for uncertainty in data quality, rather than relying on a single ‘best’ model. Finally, collaboration with area specialists is not optional but essential, ensuring that technical choices are informed by substantive knowledge of the conflict ecology.

Future research should build upon this replicated foundation to enhance predictive analytics in South Sudan and analogous settings ((Mansour et al., 2021)). A priority direction is the integration of multi-source data, combining event data with satellite imagery, network analysis of elite affiliations, and longitudinal survey data to create more holistic feature sets (CITATION). Secondly, research must develop models that are not only predictive but also interpretable, employing techniques that can elucidate which combinations of factors drive high-risk forecasts, thereby offering actionable insights beyond a simple risk score. Third, investigating adaptive learning models that can accommodate sudden shifts in conflict dynamics—common in peace processes or political shocks—would represent a significant advance over static seasonal frameworks. Ultimately, the goal is to evolve from replicating isolated models towards building robust, transparent, and interdisciplinary computational frameworks that responsibly inform conflict prevention strategies in South Sudan and beyond.


References

  1. Mansour, R., Naal, H., Kishawi, T., Achi, N.E., Hneiny, L., & Saleh, S. (2021). Health research capacity building of health workers in fragile and conflict-affected settings: a scoping review of challenges, strengths, and recommendations. Health Research Policy and Systems.
  2. Bank, W., & UNHCR, (2021). The Global Cost of Inclusive Refugee Education. World Bank, Washington, DC eBooks.