<- liwc_df %>%
liwc_df mutate(sovereignty = ifelse(str_detect(text, "sovereignty"), 1, 0),
intervention = ifelse(str_detect(text, "intervention"), 1, 0),
human_rights = ifelse(str_detect(text, "human rights"), 1, 0)) %>%
mutate(sovereignty_count = str_count(text, "sovereignty"),
intervention_count = str_count(text, "intervention"),
human_rights_count = str_count(text, "human rights"))
liwc_modeling
Introduction
In this project, I use linguistic features of state representatives’ speech transcripts from the United Nations General Debate Corpus (UNGDC) to predict the regime type. The goal is twofold. First, we aim at running a hard test for a hypothesis that countries identified with distinct regime types show different linguistic styles. If I can predict the speaker’s regime type based on linguistic features, it is a strong indication of the difference in linguistic features across regime types. Second, this project analyzes key linguistic features that act as a strong signal of the state’s regime type. I further interpret substantive implication of strong coefficients and check how consistent their degrees of significance are across the models. This script uses the scores of LIWC features and merges with country-year level meta data. "data/raw/controls/controls.csv"
has a battery of country-year level variables that might potentially confound the statistical modeling. With liwc_meta
dataset, at the country-year level, I run a series of statistical models that probe the relationship between linguistic features and sentiment scores of that speech. To preview, LIWC features alone have a strong predictive power on regime types, even without the help of meta data.
I test whether there is a correlation between a country’s invocation of international legal norms and the regime type. Among many, I generate three key legal principles that are prominent throughout the history of international politics. These are principle of sovereignty, principle of non-intervention, and the principle of human rights. Binary variables capture whether each principle was invoked, and count variables measure the number of time it was mentioned within one speech.
Variables on legal norms
- Three binary variables:
sovereignty
,intervention
, andhuman_rights
. These are coded 1 when detected at least one time during the speech, 0 otherwise. - Three count variables:
sovereignty_count
,intervention_count
,human_rights_count
. These variables are the number of keywords appearing during the speech.
Statistical Modeling
Using random effects model when the unit-specific intercepts have correlation with the input variables, and eventually lead to omitted variable bias. I suspect this would not be the case, as the linguistic features (inputs) often have high correlation with time invariant characteristics of countries. However, there does not exist a consistent theoretical conjecture on the correlation between these two. To accommodate such uncertainty, we also supplement the result by using fixed effects model.
Model 0: pooled model with LIWC features
plm package does not allow a dot feature ( y ~ . ) which selects all the columns except for the specified dependent variable. In order to avoid manual entry of all the LIWC features, I use a function called “expand_formula” from a StackOverflow.
"
Input: dependent variable in a character form, names of the input features
Output: formula ready to be used for plm package.
Example: expand_formula(output ~ .,names(data))
output ~ x1 + x2 + ... + xn
"
[1] "\n Input: dependent variable in a character form, names of the input features\n Output: formula ready to be used for plm package. \n Example: expand_formula(output ~ .,names(data)) \n output ~ x1 + x2 + ... + xn\n "
<-
expand_formula function(form="A ~.",varNames=c("A","B","C")){
<- any(grepl('.',form,fixed=TRUE))
has_dot if(has_dot){
<- intersect(as.character(as.formula(form)),
ii
varNames)<- varNames[!grepl(paste0(ii,collapse='|'),varNames)]
varNames
<- paste0(varNames,collapse='+')
exp as.formula(gsub('.',exp,form,fixed=TRUE))
}else as.formula(form)
}
- I create separate blocks of variables to adjust inclusion of metadata for later analyses.
<-liwc_df%>%
liwc_dfrename("function_feature" = "function")
<- liwc_df[liwc_df$year < 2021, ]
liwc_df <- liwc_df[liwc_df$year >= 2021, ]
validation
print(dim(liwc_df))
[1] 10181 191
<-liwc_df[, 5:122]
liwc_inputs<- liwc_df[, 1:3]
id <- liwc_df[, 149]
y <- liwc_df[, c(126:148, 150:191)]
controls
<- cbind(y, id, liwc_inputs)
data <- cbind(y, id, liwc_inputs, controls)
data_controls<- cbind(y, id, controls) controls_only
Result of statistical analysis
To supplement machine-learning approach of regression, I present a set of conventional statistical tests accounting for error correlation.
<-plm(expand_formula("dd_democracy ~. ", names(data)),
model data = data,
index = c("ccode_iso", "year"),
model = "within"
)
# standard error clustered by both time and group.
<- kable(
model_cluster_twoway tidy(coeftest(model, vcov=vcovDC(model, type="sss",
caption= "Pooled model with cluster robust standard errors"))))
Warning in sqrt(diag(se)): NaNs produced
# standard error clustered by time.
<- kable(
model_cluster_timetidy(coeftest(model, vcov=vcovHC(model, type="sss", cluster="time"),
caption = "Pooled model with clustering around time")))
# standard error clustered by country
<- kable(
model_cluster_country tidy(coeftest(model, vcov=vcovHC(model, type="sss", cluster="group"),
caption = "Pooled model with clustering around country")))
<- modelsummary(
model_summary
model, stars = TRUE,
vcov = list(
vcovDC(model, type = "sss"),
vcovHC(model, type = "sss", cluster = "time"),
vcovHC(model, type = "sss", cluster = "group")
),caption = "Pooled model with different types of standard errors",
output = "kableExtra",
escape = FALSE
)
%>%kable_classic("hover",
model_summary full_width=F,
html_font = "Cambria")%>%
scroll_box(width = "100%", height = "400px")
(1) | (2) | (3) | |
---|---|---|---|
year1951 | −0.045 | −0.045 | −0.045 |
(0.059) | (0.059) | (0.059) | |
year1952 | 0.041 | 0.041 | 0.041 |
(0.062) | (0.062) | (0.062) | |
year1953 | 0.010 | 0.010 | 0.010 |
(0.061) | (0.061) | (0.061) | |
year1954 | 0.003 | 0.003 | 0.003 |
(0.062) | (0.062) | (0.062) | |
year1955 | 0.002 | 0.002 | 0.002 |
(0.062) | (0.062) | (0.062) | |
year1956 | −0.039 | −0.039 | −0.039 |
(0.056) | (0.056) | (0.056) | |
year1957 | −0.072 | −0.072 | −0.072 |
(0.055) | (0.055) | (0.055) | |
year1958 | −0.045 | −0.045 | −0.045 |
(0.055) | (0.055) | (0.055) | |
year1959 | −0.040 | −0.040 | −0.040 |
(0.055) | (0.055) | (0.055) | |
year1960 | −0.009 | −0.009 | −0.009 |
(0.055) | (0.055) | (0.055) | |
year1961 | 0.017 | 0.017 | 0.017 |
(0.055) | (0.055) | (0.055) | |
year1962 | −0.014 | −0.014 | −0.014 |
(0.053) | (0.053) | (0.053) | |
year1963 | −0.061 | −0.061 | −0.061 |
(0.053) | (0.053) | (0.053) | |
year1964 | −0.046 | −0.046 | −0.046 |
(0.053) | (0.053) | (0.053) | |
year1965 | −0.050 | −0.050 | −0.050 |
(0.053) | (0.053) | (0.053) | |
year1966 | −0.067 | −0.067 | −0.067 |
(0.054) | (0.054) | (0.054) | |
year1967 | −0.077 | −0.077 | −0.077 |
(0.053) | (0.053) | (0.053) | |
year1968 | −0.107* | −0.107* | −0.107* |
(0.053) | (0.053) | (0.053) | |
year1969 | −0.099+ | −0.099+ | −0.099+ |
(0.053) | (0.053) | (0.053) | |
year1970 | −0.061 | −0.061 | −0.061 |
(0.056) | (0.056) | (0.056) | |
year1971 | −0.078 | −0.078 | −0.078 |
(0.052) | (0.052) | (0.052) | |
year1972 | −0.080 | −0.080 | −0.080 |
(0.051) | (0.051) | (0.051) | |
year1973 | −0.098+ | −0.098+ | −0.098+ |
(0.052) | (0.052) | (0.052) | |
year1974 | −0.081 | −0.081 | −0.081 |
(0.052) | (0.052) | (0.052) | |
year1975 | −0.078 | −0.078 | −0.078 |
(0.051) | (0.051) | (0.051) | |
year1976 | −0.084 | −0.084 | −0.084 |
(0.051) | (0.051) | (0.051) | |
year1977 | −0.090+ | −0.090+ | −0.090+ |
(0.051) | (0.051) | (0.051) | |
year1978 | −0.074 | −0.074 | −0.074 |
(0.051) | (0.051) | (0.051) | |
year1979 | −0.065 | −0.065 | −0.065 |
(0.051) | (0.051) | (0.051) | |
year1980 | −0.048 | −0.048 | −0.048 |
(0.051) | (0.051) | (0.051) | |
year1981 | −0.044 | −0.044 | −0.044 |
(0.051) | (0.051) | (0.051) | |
year1982 | −0.062 | −0.062 | −0.062 |
(0.051) | (0.051) | (0.051) | |
year1983 | −0.016 | −0.016 | −0.016 |
(0.051) | (0.051) | (0.051) | |
year1984 | −0.015 | −0.015 | −0.015 |
(0.051) | (0.051) | (0.051) | |
year1985 | −0.052 | −0.052 | −0.052 |
(0.052) | (0.052) | (0.052) | |
year1986 | −0.025 | −0.025 | −0.025 |
(0.052) | (0.052) | (0.052) | |
year1987 | −0.031 | −0.031 | −0.031 |
(0.052) | (0.052) | (0.052) | |
year1988 | −0.016 | −0.016 | −0.016 |
(0.052) | (0.052) | (0.052) | |
year1989 | 0.007 | 0.007 | 0.007 |
(0.052) | (0.052) | (0.052) | |
year1990 | 0.054 | 0.054 | 0.054 |
(0.052) | (0.052) | (0.052) | |
year1991 | 0.052 | 0.052 | 0.052 |
(0.052) | (0.052) | (0.052) | |
year1992 | 0.086+ | 0.086+ | 0.086+ |
(0.051) | (0.051) | (0.051) | |
year1993 | 0.102* | 0.102* | 0.102* |
(0.051) | (0.051) | (0.051) | |
year1994 | 0.112* | 0.112* | 0.112* |
(0.051) | (0.051) | (0.051) | |
year1995 | 0.112* | 0.112* | 0.112* |
(0.051) | (0.051) | (0.051) | |
year1996 | 0.098+ | 0.098+ | 0.098+ |
(0.052) | (0.052) | (0.052) | |
year1997 | 0.079 | 0.079 | 0.079 |
(0.052) | (0.052) | (0.052) | |
year1998 | 0.065 | 0.065 | 0.065 |
(0.052) | (0.052) | (0.052) | |
year1999 | 0.080 | 0.080 | 0.080 |
(0.052) | (0.052) | (0.052) | |
year2000 | 0.082 | 0.082 | 0.082 |
(0.052) | (0.052) | (0.052) | |
year2001 | 0.107* | 0.107* | 0.107* |
(0.053) | (0.053) | (0.053) | |
year2002 | 0.094+ | 0.094+ | 0.094+ |
(0.053) | (0.053) | (0.053) | |
year2003 | 0.086+ | 0.086+ | 0.086+ |
(0.052) | (0.052) | (0.052) | |
year2004 | 0.089+ | 0.089+ | 0.089+ |
(0.053) | (0.053) | (0.053) | |
year2005 | 0.084 | 0.084 | 0.084 |
(0.052) | (0.052) | (0.052) | |
year2006 | 0.090+ | 0.090+ | 0.090+ |
(0.052) | (0.052) | (0.052) | |
year2007 | 0.089+ | 0.089+ | 0.089+ |
(0.053) | (0.053) | (0.053) | |
year2008 | 0.106* | 0.106* | 0.106* |
(0.053) | (0.053) | (0.053) | |
year2009 | 0.124* | 0.124* | 0.124* |
(0.053) | (0.053) | (0.053) | |
year2010 | 0.098+ | 0.098+ | 0.098+ |
(0.053) | (0.053) | (0.053) | |
year2011 | 0.112* | 0.112* | 0.112* |
(0.053) | (0.053) | (0.053) | |
year2012 | 0.144** | 0.144** | 0.144** |
(0.053) | (0.053) | (0.053) | |
year2013 | 0.120* | 0.120* | 0.120* |
(0.053) | (0.053) | (0.053) | |
year2014 | 0.135* | 0.135* | 0.135* |
(0.053) | (0.053) | (0.053) | |
year2015 | 0.137* | 0.137* | 0.137* |
(0.054) | (0.054) | (0.054) | |
year2016 | 0.118* | 0.118* | 0.118* |
(0.053) | (0.053) | (0.053) | |
year2017 | 0.147** | 0.147** | 0.147** |
(0.053) | (0.053) | (0.053) | |
year2018 | 0.109* | 0.109* | 0.109* |
(0.053) | (0.053) | (0.053) | |
year2019 | 0.116* | 0.116* | 0.116* |
(0.054) | (0.054) | (0.054) | |
year2020 | 0.119* | 0.119* | 0.119* |
(0.060) | (0.060) | (0.060) | |
WC | 0.000+ | 0.000+ | 0.000+ |
(0.000) | (0.000) | (0.000) | |
Analytic | 0.006*** | 0.006*** | 0.006*** |
(0.002) | (0.002) | (0.002) | |
Clout | 0.005* | 0.005* | 0.005* |
(0.002) | (0.002) | (0.002) | |
Authentic | 0.002 | 0.002 | 0.002 |
(0.003) | (0.003) | (0.003) | |
Tone | 0.001 | 0.001 | 0.001 |
(0.001) | (0.001) | (0.001) | |
WPS | −0.002*** | −0.002*** | −0.002*** |
(0.000) | (0.000) | (0.000) | |
BigWords | −0.004 | −0.004 | −0.004 |
(0.003) | (0.003) | (0.003) | |
Dic | −0.004 | −0.004 | −0.004 |
(0.005) | (0.005) | (0.005) | |
Linguistic | 0.024* | 0.024* | 0.024* |
(0.010) | (0.010) | (0.010) | |
function_feature | −0.037*** | −0.037*** | −0.037*** |
(0.011) | (0.011) | (0.011) | |
pronoun | 0.156 | 0.156 | 0.156 |
(0.557) | (0.557) | (0.557) | |
ppron | −0.220 | −0.220 | −0.220 |
(0.558) | (0.558) | (0.558) | |
i | 0.160*** | 0.160*** | 0.160*** |
(0.039) | (0.039) | (0.039) | |
we | 0.043 | 0.043 | 0.043 |
(0.034) | (0.034) | (0.034) | |
you | 0.040 | 0.040 | 0.040 |
(0.039) | (0.039) | (0.039) | |
shehe | 0.092* | 0.092* | 0.092* |
(0.046) | (0.046) | (0.046) | |
they | 0.040 | 0.040 | 0.040 |
(0.032) | (0.032) | (0.032) | |
ipron | −0.121 | −0.121 | −0.121 |
(0.557) | (0.557) | (0.557) | |
det | 0.004 | 0.004 | 0.004 |
(0.007) | (0.007) | (0.007) | |
article | 0.004 | 0.004 | 0.004 |
(0.010) | (0.010) | (0.010) | |
number | 0.002 | 0.002 | 0.002 |
(0.009) | (0.009) | (0.009) | |
prep | 0.010 | 0.010 | 0.010 |
(0.009) | (0.009) | (0.009) | |
auxverb | 0.038** | 0.038** | 0.038** |
(0.014) | (0.014) | (0.014) | |
adverb | 0.017+ | 0.017+ | 0.017+ |
(0.009) | (0.009) | (0.009) | |
conj | 0.026** | 0.026** | 0.026** |
(0.009) | (0.009) | (0.009) | |
negate | 0.055* | 0.055* | 0.055* |
(0.022) | (0.022) | (0.022) | |
verb | −0.012 | −0.012 | −0.012 |
(0.012) | (0.012) | (0.012) | |
adj | −0.024** | −0.024** | −0.024** |
(0.009) | (0.009) | (0.009) | |
quantity | 0.024** | 0.024** | 0.024** |
(0.009) | (0.009) | (0.009) | |
Drives | −0.099*** | −0.099*** | −0.099*** |
(0.027) | (0.027) | (0.027) | |
affiliation | 0.108*** | 0.108*** | 0.108*** |
(0.028) | (0.028) | (0.028) | |
achieve | 0.110*** | 0.110*** | 0.110*** |
(0.026) | (0.026) | (0.026) | |
power | 0.099*** | 0.099*** | 0.099*** |
(0.027) | (0.027) | (0.027) | |
Cognition | −0.036 | −0.036 | −0.036 |
(0.049) | (0.049) | (0.049) | |
allnone | −0.015 | −0.015 | −0.015 |
(0.050) | (0.050) | (0.050) | |
cogproc | 0.032 | 0.032 | 0.032 |
(0.050) | (0.050) | (0.050) | |
insight | 0.031 | 0.031 | 0.031 |
(0.020) | (0.020) | (0.020) | |
cause | 0.003 | 0.003 | 0.003 |
(0.013) | (0.013) | (0.013) | |
discrep | 0.008 | 0.008 | 0.008 |
(0.021) | (0.021) | (0.021) | |
tentat | 0.018 | 0.018 | 0.018 |
(0.013) | (0.013) | (0.013) | |
certitude | 0.003 | 0.003 | 0.003 |
(0.015) | (0.015) | (0.015) | |
differ | 0.037 | 0.037 | 0.037 |
(0.024) | (0.024) | (0.024) | |
memory | 0.038 | 0.038 | 0.038 |
(0.058) | (0.058) | (0.058) | |
Affect | −0.074 | −0.074 | −0.074 |
(0.126) | (0.126) | (0.126) | |
tone_pos | 0.050 | 0.050 | 0.050 |
(0.128) | (0.128) | (0.128) | |
tone_neg | 0.078 | 0.078 | 0.078 |
(0.128) | (0.128) | (0.128) | |
emotion | 0.095 | 0.095 | 0.095 |
(0.120) | (0.120) | (0.120) | |
emo_pos | −0.065 | −0.065 | −0.065 |
(0.121) | (0.121) | (0.121) | |
emo_neg | −0.143 | −0.143 | −0.143 |
(0.126) | (0.126) | (0.126) | |
emo_anx | 0.038 | 0.038 | 0.038 |
(0.054) | (0.054) | (0.054) | |
emo_anger | 0.040 | 0.040 | 0.040 |
(0.048) | (0.048) | (0.048) | |
emo_sad | 0.111+ | 0.111+ | 0.111+ |
(0.057) | (0.057) | (0.057) | |
swear | −0.077 | −0.077 | −0.077 |
(0.207) | (0.207) | (0.207) | |
Social | −0.017 | −0.017 | −0.017 |
(0.012) | (0.012) | (0.012) | |
socbehav | 0.031+ | 0.031+ | 0.031+ |
(0.016) | (0.016) | (0.016) | |
prosocial | 0.011 | 0.011 | 0.011 |
(0.014) | (0.014) | (0.014) | |
polite | −0.027 | −0.027 | −0.027 |
(0.021) | (0.021) | (0.021) | |
conflict | 0.012 | 0.012 | 0.012 |
(0.021) | (0.021) | (0.021) | |
moral | 0.014 | 0.014 | 0.014 |
(0.015) | (0.015) | (0.015) | |
comm | −0.023 | −0.023 | −0.023 |
(0.015) | (0.015) | (0.015) | |
socrefs | 0.009 | 0.009 | 0.009 |
(0.020) | (0.020) | (0.020) | |
family | −0.034 | −0.034 | −0.034 |
(0.038) | (0.038) | (0.038) | |
friend | −0.135+ | −0.135+ | −0.135+ |
(0.077) | (0.077) | (0.077) | |
female | 0.025 | 0.025 | 0.025 |
(0.027) | (0.027) | (0.027) | |
male | −0.025 | −0.025 | −0.025 |
(0.026) | (0.026) | (0.026) | |
Culture | −0.396+ | −0.396+ | −0.396+ |
(0.230) | (0.230) | (0.230) | |
politic | 0.407+ | 0.407+ | 0.407+ |
(0.230) | (0.230) | (0.230) | |
ethnicity | 0.435+ | 0.435+ | 0.435+ |
(0.229) | (0.229) | (0.229) | |
tech | 0.387+ | 0.387+ | 0.387+ |
(0.231) | (0.231) | (0.231) | |
Lifestyle | −0.001 | −0.001 | −0.001 |
(0.030) | (0.030) | (0.030) | |
leisure | −0.068 | −0.068 | −0.068 |
(0.043) | (0.043) | (0.043) | |
home | 0.077 | 0.077 | 0.077 |
(0.054) | (0.054) | (0.054) | |
work | 0.015 | 0.015 | 0.015 |
(0.029) | (0.029) | (0.029) | |
money | 0.017 | 0.017 | 0.017 |
(0.026) | (0.026) | (0.026) | |
relig | 0.008 | 0.008 | 0.008 |
(0.035) | (0.035) | (0.035) | |
Physical | 0.025 | 0.025 | 0.025 |
(0.022) | (0.022) | (0.022) | |
health | 0.015 | 0.015 | 0.015 |
(0.034) | (0.034) | (0.034) | |
illness | −0.019 | −0.019 | −0.019 |
(0.043) | (0.043) | (0.043) | |
wellness | −0.006 | −0.006 | −0.006 |
(0.066) | (0.066) | (0.066) | |
mental | −0.008 | −0.008 | −0.008 |
(0.127) | (0.127) | (0.127) | |
substances | −0.017 | −0.017 | −0.017 |
(0.096) | (0.096) | (0.096) | |
sexual | −0.002 | −0.002 | −0.002 |
(0.051) | (0.051) | (0.051) | |
food | 0.033 | 0.033 | 0.033 |
(0.034) | (0.034) | (0.034) | |
death | −0.042 | −0.042 | −0.042 |
(0.042) | (0.042) | (0.042) | |
need | −0.017 | −0.017 | −0.017 |
(0.012) | (0.012) | (0.012) | |
want | −0.014 | −0.014 | −0.014 |
(0.026) | (0.026) | (0.026) | |
acquire | −0.047* | −0.047* | −0.047* |
(0.024) | (0.024) | (0.024) | |
lack | 0.076** | 0.076** | 0.076** |
(0.025) | (0.025) | (0.025) | |
fulfill | 0.020 | 0.020 | 0.020 |
(0.027) | (0.027) | (0.027) | |
fatigue | 0.488* | 0.488* | 0.488* |
(0.201) | (0.201) | (0.201) | |
reward | 0.053* | 0.053* | 0.053* |
(0.023) | (0.023) | (0.023) | |
risk | −0.005 | −0.005 | −0.005 |
(0.012) | (0.012) | (0.012) | |
curiosity | −0.053* | −0.053* | −0.053* |
(0.024) | (0.024) | (0.024) | |
allure | 0.006 | 0.006 | 0.006 |
(0.007) | (0.007) | (0.007) | |
Perception | 0.065* | 0.065* | 0.065* |
(0.028) | (0.028) | (0.028) | |
attention | −0.070* | −0.070* | −0.070* |
(0.032) | (0.032) | (0.032) | |
motion | −0.057+ | −0.057+ | −0.057+ |
(0.030) | (0.030) | (0.030) | |
space | −0.062+ | −0.062+ | −0.062+ |
(0.032) | (0.032) | (0.032) | |
visual | −0.049 | −0.049 | −0.049 |
(0.032) | (0.032) | (0.032) | |
auditory | −0.016 | −0.016 | −0.016 |
(0.054) | (0.054) | (0.054) | |
feeling | −0.072+ | −0.072+ | −0.072+ |
(0.041) | (0.041) | (0.041) | |
time | −0.012 | −0.012 | −0.012 |
(0.017) | (0.017) | (0.017) | |
focuspast | −0.006 | −0.006 | −0.006 |
(0.009) | (0.009) | (0.009) | |
focuspresent | −0.010 | −0.010 | −0.010 |
(0.009) | (0.009) | (0.009) | |
focusfuture | 0.009 | 0.009 | 0.009 |
(0.011) | (0.011) | (0.011) | |
Conversation | 0.627 | 0.627 | 0.627 |
(0.631) | (0.631) | (0.631) | |
netspeak | −0.726 | −0.726 | −0.726 |
(0.605) | (0.605) | (0.605) | |
assent | −0.641 | −0.641 | −0.641 |
(0.626) | (0.626) | (0.626) | |
nonflu | 1.015 | 1.015 | 1.015 |
(0.910) | (0.910) | (0.910) | |
filler | −5.699 | −5.699 | −5.699 |
(3.623) | (3.623) | (3.623) | |
AllPunc | 0.551 | 0.551 | 0.551 |
(0.405) | (0.405) | (0.405) | |
Period | −0.542 | −0.542 | −0.542 |
(0.406) | (0.406) | (0.406) | |
Comma | −0.556 | −0.556 | −0.556 |
(0.406) | (0.406) | (0.406) | |
QMark | −0.509 | −0.509 | −0.509 |
(0.407) | (0.407) | (0.407) | |
Exclam | −0.749+ | −0.749+ | −0.749+ |
(0.422) | (0.422) | (0.422) | |
Apostro | −0.611 | −0.611 | −0.611 |
(0.406) | (0.406) | (0.406) | |
OtherP | −0.553 | −0.553 | −0.553 |
(0.406) | (0.406) | (0.406) | |
Num.Obs. | 9646 | 9646 | 9646 |
R2 | 0.191 | 0.191 | 0.191 |
R2 Adj. | 0.159 | 0.159 | 0.159 |
AIC | 2036.9 | 2036.9 | 2036.9 |
BIC | 3385.7 | 3385.7 | 3385.7 |
RMSE | 0.26 | 0.26 | 0.26 |
Std.Errors | Custom | Custom | Custom |
+ p |
Prediction regression and cross valication
Separate test data from validation set.
Take out the test data set (just a few years) and then split for th e validation data. In my dataset, I carve out observations from two years (2021 and 2022) as my test data.
Within the test data, I split the data in to two groups: pre and post Cold War with a threshold of 1990. I create several models based on the pre-Cold War era and generate model evaluation metrics by applying the models to the post Cold War era.
Experiment 1: Random CV
This includes both country level meta data and LIWC features. I split the training and testing dataset randomly.
$ccode_iso<-as.factor(data_controls$ccode_iso)
data_controls# Remove the country variable from the training data
<- data_controls[, !(names(data_controls) %in% c("ccode_iso", "year", "session", "dd_regime"))]
data_controls <- data_controls[, !(names(data_controls) %in% c("democracy"))]
data_controls <-data_controls%>%dplyr::select(-Segment)
data_controls
set.seed(1)
$dd_democracy<-as.factor(data_controls$dd_democracy)
data_controls
<- rsample::initial_split(data_controls, prop = 0.7)
split <- rsample::training(split)
trainN <- rsample::testing(split)
testN
<- glm(
model0 formula = dd_democracy ~ . ,
family = "binomial",
data = trainN)
<-predict(model0, testN, type = "response")
model0_pred
<- as.factor(ifelse(model0_pred > 0.5, 1,0))
model0_diagnosis
# Compute confusion matrix
::confusionMatrix(as.factor(model0_diagnosis), as.factor(testN$dd_democracy)) caret
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 583 36
1 34 571
Accuracy : 0.9428
95% CI : (0.9283, 0.9552)
No Information Rate : 0.5041
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8856
Mcnemar's Test P-Value : 0.9049
Sensitivity : 0.9449
Specificity : 0.9407
Pos Pred Value : 0.9418
Neg Pred Value : 0.9438
Prevalence : 0.5041
Detection Rate : 0.4763
Detection Prevalence : 0.5057
Balanced Accuracy : 0.9428
'Positive' Class : 0
Interpretation of the model
Summary table of the estimation results highlight features that play important role in predicting the regime type. Below plot displays the fitted model of how a linguistic feature of “focuspast” affects an outcome of regime type. It seems that there is a weak but consistent positive correlation between the linguistic tendency to focus on the past and the regime type. This pattern is consistent regardless of a country’s history of being a former colony.
<-modelsummary(model0,
model_summary2stars = TRUE,
output = "kableExtra",
escape = FALSE)
%>%kable_classic(full_width=F, html_font = "Cambria")%>%
model_summary2scroll_box(width = "100%", height = "600px")
(1) | |
---|---|
(Intercept) | 56.129** |
(19.600) | |
WC | 0.000 |
(0.000) | |
Analytic | −0.177* |
(0.082) | |
Clout | 0.073 |
(0.102) | |
Authentic | 0.208 |
(0.146) | |
Tone | 0.048 |
(0.073) | |
WPS | 0.032 |
(0.021) | |
BigWords | 0.002 |
(0.136) | |
Dic | −0.356+ |
(0.214) | |
Linguistic | −0.383 |
(0.467) | |
function_feature | 0.212 |
(0.541) | |
pronoun | −23.295 |
(25.957) | |
ppron | 20.785 |
(25.931) | |
i | 1.727 |
(1.848) | |
we | 0.977 |
(1.540) | |
you | 1.116 |
(1.819) | |
shehe | 4.889* |
(2.388) | |
they | 0.385 |
(1.558) | |
ipron | 23.246 |
(25.984) | |
det | 0.856* |
(0.410) | |
article | −0.558 |
(0.531) | |
number | −0.578 |
(0.383) | |
prep | 0.188 |
(0.436) | |
auxverb | −0.518 |
(0.674) | |
adverb | −0.545 |
(0.454) | |
conj | 0.038 |
(0.435) | |
negate | −0.553 |
(0.945) | |
verb | 0.287 |
(0.540) | |
adj | 0.015 |
(0.447) | |
quantity | −0.104 |
(0.436) | |
Drives | 0.142 |
(1.353) | |
affiliation | −0.450 |
(1.378) | |
achieve | 0.702 |
(1.336) | |
power | 0.369 |
(1.361) | |
Cognition | 5.500* |
(2.315) | |
allnone | −6.289** |
(2.383) | |
cogproc | −4.793* |
(2.373) | |
insight | −1.021 |
(0.985) | |
cause | −1.302* |
(0.630) | |
discrep | 1.417 |
(1.001) | |
tentat | −0.535 |
(0.667) | |
certitude | −1.121 |
(0.703) | |
differ | −0.609 |
(1.165) | |
memory | −6.128* |
(2.688) | |
Affect | −0.022 |
(5.690) | |
tone_pos | −0.041 |
(5.839) | |
tone_neg | 0.831 |
(5.769) | |
emotion | −0.774 |
(5.514) | |
emo_pos | 0.844 |
(5.602) | |
emo_neg | 0.821 |
(5.795) | |
emo_anx | 1.537 |
(2.562) | |
emo_anger | −1.882 |
(2.438) | |
emo_sad | 0.102 |
(2.483) | |
swear | −5.161 |
(7.616) | |
Social | 0.380 |
(0.576) | |
socbehav | −0.103 |
(0.707) | |
prosocial | −1.245* |
(0.578) | |
polite | −0.159 |
(0.885) | |
conflict | −0.067 |
(0.969) | |
moral | −0.400 |
(0.737) | |
comm | −0.899 |
(0.674) | |
socrefs | 0.513 |
(0.883) | |
family | 1.600 |
(1.729) | |
friend | −1.118 |
(4.068) | |
female | −3.629*** |
(0.893) | |
male | −2.186 |
(1.438) | |
Culture | −13.922 |
(12.687) | |
politic | 13.677 |
(12.684) | |
ethnicity | 13.891 |
(12.669) | |
tech | 14.631 |
(12.729) | |
Lifestyle | −0.439 |
(1.383) | |
leisure | −0.848 |
(1.967) | |
home | 1.860 |
(2.424) | |
work | 0.167 |
(1.329) | |
money | 1.549 |
(1.183) | |
relig | 0.291 |
(1.483) | |
Physical | −0.310 |
(1.005) | |
health | 1.032 |
(1.550) | |
illness | −1.617 |
(1.698) | |
wellness | 2.602 |
(2.940) | |
mental | −7.307 |
(6.068) | |
substances | 7.125 |
(4.889) | |
sexual | 0.010 |
(1.714) | |
food | 1.069 |
(1.163) | |
death | 2.108 |
(1.767) | |
need | 0.204 |
(0.543) | |
want | 0.082 |
(1.283) | |
acquire | −0.196 |
(1.053) | |
lack | −2.447* |
(0.981) | |
fulfill | 3.340** |
(1.235) | |
fatigue | 5.842 |
(9.757) | |
reward | −1.613 |
(0.991) | |
risk | −0.493 |
(0.529) | |
curiosity | −0.879 |
(1.112) | |
allure | 0.618* |
(0.305) | |
Perception | −1.966 |
(1.222) | |
attention | 2.146 |
(1.483) | |
motion | 0.450 |
(1.414) | |
space | 0.839 |
(1.507) | |
visual | 2.029 |
(1.462) | |
auditory | −0.259 |
(2.508) | |
feeling | 1.492 |
(2.047) | |
time | −0.885 |
(0.846) | |
focuspast | 0.479 |
(0.399) | |
focuspresent | 0.187 |
(0.395) | |
focusfuture | 0.205 |
(0.534) | |
Conversation | 5.188 |
(22.259) | |
netspeak | 4.449 |
(20.407) | |
assent | −6.620 |
(21.828) | |
nonflu | 32.286 |
(48.233) | |
filler | −236.128 |
(13385.385) | |
AllPunc | 14.746 |
(20.193) | |
Period | −13.898 |
(20.179) | |
Comma | −14.805 |
(20.192) | |
QMark | −13.590 |
(20.066) | |
Exclam | −30.759 |
(22.656) | |
Apostro | −12.302 |
(20.174) | |
OtherP | −14.791 |
(20.203) | |
mid_dispute | 0.021 |
(0.159) | |
wdi_gdpcapcon2015 | 0.000*** |
(0.000) | |
wdi_gdpcapgr | 0.012 |
(0.031) | |
wdi_pop | 0.000 |
(0.000) | |
pts_ptss | −0.285 |
(0.212) | |
bmr_dem | 3.787*** |
(0.473) | |
kofgi_dr_eg | −0.156 |
(0.307) | |
kofgi_dr_ig | 0.285 |
(0.911) | |
kofgi_dr_pg | 0.010 |
(0.309) | |
kofgi_dr_sg | −0.206 |
(0.300) | |
wdi_log_gdpcapcon2015 | −0.931* |
(0.373) | |
wdi_log_pop | −1.365*** |
(0.284) | |
polity | 0.396 |
(0.358) | |
polity2 | 0.109 |
(0.287) | |
plty_xrcomp | 1.619*** |
(0.403) | |
plty_xropen | −0.274 |
(0.210) | |
plty_xconst | −0.989*** |
(0.261) | |
plty_parreg | 0.297 |
(0.279) | |
plty_parcomp | −0.994*** |
(0.219) | |
navco_num_campaign | 0.646 |
(0.472) | |
navco_campaign | −1.011 |
(0.658) | |
up_num_conflict | 0.144+ |
(0.079) | |
up_conflict | 0.090 |
(0.383) | |
up_num_war | −0.205 |
(0.188) | |
up_war | 0.589 |
(0.542) | |
v2x_polyarchy | −8.736 |
(6.993) | |
v2x_libdem | 17.852+ |
(9.112) | |
v2x_freexp_altinf | −10.215 |
(22.267) | |
v2x_frassoc_thick | 20.023*** |
(5.353) | |
v2xel_locelec | −2.467*** |
(0.678) | |
v2xel_regelec | 2.273*** |
(0.636) | |
v2mecenefm | 3.073*** |
(0.535) | |
v2mecrit | −0.454 |
(0.498) | |
v2mefemjrn | 0.092*** |
(0.023) | |
v2meharjrn | 0.944** |
(0.361) | |
v2mebias | 0.374 |
(0.460) | |
v2mecorrpt | −0.272 |
(0.261) | |
v2meslfcen | −0.201 |
(0.369) | |
v2x_accountability | 4.663** |
(1.714) | |
v2x_horacc | −0.724 |
(0.681) | |
v2x_diagacc | −6.909** |
(2.274) | |
v2xnp_regcorr | −10.914** |
(3.705) | |
v2x_civlib | 25.843 |
(83.332) | |
v2x_clphy | −8.202 |
(27.502) | |
v2x_clpol | −42.879 |
(28.508) | |
v2x_clpriv | −4.228 |
(27.149) | |
v2x_corr | 31.146*** |
(6.097) | |
v2x_pubcorr | −11.666*** |
(2.679) | |
v2jucorrdc | 1.927*** |
(0.471) | |
v2x_rule | −7.827* |
(3.936) | |
v2xcl_acjst | −2.272 |
(1.535) | |
v2xcs_ccsi | 1.947 |
(3.536) | |
v2x_freexp | 23.233+ |
(13.869) | |
v2xme_altinf | 9.918 |
(10.429) | |
v2xedvd_me_cent | 9.479** |
(3.408) | |
ht_colonial | −1.653** |
(0.518) | |
sovereignty | 1.113* |
(0.461) | |
intervention | 0.373 |
(0.496) | |
human_rights | 0.545 |
(0.352) | |
sovereignty_count | −0.300+ |
(0.178) | |
intervention_count | 0.201 |
(0.212) | |
human_rights_count | 0.086 |
(0.091) | |
Num.Obs. | 2828 |
AIC | 909.2 |
BIC | 1979.7 |
Log.Lik. | −274.593 |
RMSE | 0.16 |
+ p |
visreg(model0, xvar = "focusfuture", by = "ht_colonial", scale = "linear")
Experiment 2: LIWC features only
I included linguistic style features as well as word counts(WC), the raw number of words within each speech Word counts can change the distribution of linguistic features, as the pool from which words come from can affect the count of dictionary words. Another alternative measures is Words per Sentence (WPS), average number of words within a sentence for each document. Big Words (BW) also captures the percentage of words that are 7 letters or longer. Dictionary (Dic) variable refers to the percentage of words that were captured by LIWC.
There are four summary variables; Analytical, Clout, Authenticity, and Emotional Tone. Each are constructed based on LIWC features as well as other psychological literature (boyd2015?; cohn2004?; kacewicz2014?; newman2003?). These summary variables range from 1 to 99, after standardization. Analytical thinking score increases when the speech includes formal, logical, and hierarchical patterns. Clout is intended to capture the degree in which the speaker displays one’s confidence, status, and leadership. Authenticity score increases when the speaker does not intend to regulate oneself and displays a spontaneous style. Emotional score is an overall tone of the speech, intended to show how positive or negative a speaker is.
Note that one word can be captured by different bags of dictionaries. For example, the word “we” will be counted toward a “affiliation”, “social reference”, and a pronoun dictionary.
LIWC-22 has additional features like “determiners,” that includes “this”, “first”, “three.” It also has a feature “Cognition,” that reflects different ways in which people process their thoughts. It is shown by dichotomous logical words, memory words, and words that reveal certainty. It also includes “political” as one of the three domains within the overarching “Culture” dimension. Examples for “political” construct are congress, parliament, president, democratic, law, and court.
<- cbind(y, id, liwc_inputs)
data <- data[, !(names(data) %in% c("ccode_iso", "year"))]
data
<- rsample::initial_split(data, prop = 0.7, strata = "dd_democracy")
split <- rsample::training(split)
trainN <- rsample::testing(split)
testN
<- glm(
model1 formula = dd_democracy ~ . - Segment,
family = "binomial",
data = trainN)
<-modelsummary(model1,
model_summary3stars = TRUE,
output = "kableExtra",
escape = FALSE)
%>%kable_classic(full_width=F, html_font = "Cambria")%>%
model_summary3scroll_box(width = "100%", height = "600px")
(1) | |
---|---|
(Intercept) | 6.619 |
(4.114) | |
session | −0.010* |
(0.004) | |
WC | 0.000 |
(0.000) | |
Analytic | 0.038* |
(0.017) | |
Clout | 0.065* |
(0.028) | |
Authentic | 0.053+ |
(0.031) | |
Tone | 0.071*** |
(0.017) | |
WPS | −0.011** |
(0.004) | |
BigWords | −0.128*** |
(0.029) | |
Dic | −0.060 |
(0.052) | |
Linguistic | 0.289** |
(0.102) | |
function_feature | −0.562*** |
(0.121) | |
pronoun | 11.820+ |
(6.091) | |
ppron | −12.310* |
(6.098) | |
i | 1.766*** |
(0.423) | |
we | 0.033 |
(0.372) | |
you | −1.026* |
(0.426) | |
shehe | 0.568 |
(0.495) | |
they | −0.683+ |
(0.352) | |
ipron | −11.448+ |
(6.089) | |
det | −0.110 |
(0.079) | |
article | 0.205+ |
(0.110) | |
number | −0.195* |
(0.087) | |
prep | 0.119 |
(0.102) | |
auxverb | 0.404** |
(0.154) | |
adverb | 0.290** |
(0.101) | |
conj | 0.058 |
(0.098) | |
negate | 0.799*** |
(0.239) | |
verb | −0.167 |
(0.124) | |
adj | −0.229* |
(0.097) | |
quantity | 0.363*** |
(0.092) | |
Drives | −1.338*** |
(0.293) | |
affiliation | 1.255*** |
(0.300) | |
achieve | 1.382*** |
(0.285) | |
power | 0.996*** |
(0.291) | |
Cognition | −0.635 |
(0.540) | |
allnone | −0.367 |
(0.555) | |
cogproc | 0.303 |
(0.555) | |
insight | 0.016 |
(0.224) | |
cause | 0.373** |
(0.139) | |
discrep | 0.763** |
(0.232) | |
tentat | 0.223 |
(0.136) | |
certitude | 0.359* |
(0.158) | |
differ | 0.731** |
(0.267) | |
memory | 1.591* |
(0.640) | |
Affect | −2.393+ |
(1.348) | |
tone_pos | 1.397 |
(1.370) | |
tone_neg | 2.970* |
(1.374) | |
emotion | 1.919 |
(1.284) | |
emo_pos | −2.717* |
(1.296) | |
emo_neg | −2.312+ |
(1.355) | |
emo_anx | 1.713** |
(0.587) | |
emo_anger | −0.844 |
(0.533) | |
emo_sad | 0.968 |
(0.615) | |
swear | 0.868 |
(2.201) | |
Social | −0.032 |
(0.129) | |
socbehav | 0.638*** |
(0.174) | |
prosocial | −0.378* |
(0.148) | |
polite | −1.031*** |
(0.216) | |
conflict | 0.020 |
(0.230) | |
moral | 0.200 |
(0.163) | |
comm | −0.117 |
(0.166) | |
socrefs | 0.385+ |
(0.220) | |
family | −0.813+ |
(0.429) | |
friend | −3.284*** |
(0.871) | |
female | 0.702* |
(0.331) | |
male | −1.222*** |
(0.278) | |
Culture | −8.330** |
(2.900) | |
politic | 8.428** |
(2.898) | |
ethnicity | 7.820** |
(2.893) | |
tech | 9.105** |
(2.907) | |
Lifestyle | −1.138*** |
(0.325) | |
leisure | 0.290 |
(0.464) | |
home | 1.943*** |
(0.573) | |
work | 1.174*** |
(0.315) | |
money | 1.082*** |
(0.283) | |
relig | 1.132** |
(0.362) | |
Physical | 0.905*** |
(0.235) | |
health | 0.041 |
(0.376) | |
illness | −1.123** |
(0.422) | |
wellness | 1.069 |
(0.721) | |
mental | −0.337 |
(1.410) | |
substances | −0.242 |
(1.090) | |
sexual | −0.057 |
(0.522) | |
food | −1.244** |
(0.381) | |
death | 0.553 |
(0.452) | |
need | 0.378** |
(0.132) | |
want | −0.242 |
(0.283) | |
acquire | −0.708** |
(0.261) | |
lack | −0.274 |
(0.267) | |
fulfill | 0.456 |
(0.289) | |
fatigue | 7.515** |
(2.285) | |
reward | −0.151 |
(0.246) | |
risk | 0.558*** |
(0.130) | |
curiosity | 0.081 |
(0.260) | |
allure | −0.121+ |
(0.071) | |
Perception | 0.390 |
(0.304) | |
attention | −0.892* |
(0.349) | |
motion | −0.646+ |
(0.339) | |
space | −0.717* |
(0.355) | |
visual | −0.274 |
(0.352) | |
auditory | −0.725 |
(0.588) | |
feeling | −0.853+ |
(0.440) | |
time | −0.161 |
(0.193) | |
focuspast | 0.078 |
(0.092) | |
focuspresent | −0.037 |
(0.093) | |
focusfuture | 0.622*** |
(0.121) | |
Conversation | 0.859 |
(6.528) | |
netspeak | 0.807 |
(6.310) | |
assent | −0.668 |
(6.462) | |
nonflu | −4.715 |
(10.425) | |
filler | 30.012 |
(36.872) | |
AllPunc | −2.234 |
(4.438) | |
Period | 2.421 |
(4.438) | |
Comma | 2.105 |
(4.439) | |
QMark | 1.770 |
(4.449) | |
Exclam | 0.084 |
(4.595) | |
Apostro | 1.554 |
(4.440) | |
OtherP | 2.193 |
(4.438) | |
Num.Obs. | 6753 |
AIC | 6770.1 |
BIC | 7581.4 |
Log.Lik. | −3266.039 |
F | 13.352 |
RMSE | 0.40 |
+ p |
<-predict(model1, testN, type = "response")
model1_pred
<- as.factor(ifelse(model1_pred > 0.5, 1,0))
model1_diagnosis
# Compute confusion matrix
::confusionMatrix(as.factor(model1_diagnosis), as.factor(testN$dd_democracy)) caret
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1102 374
1 358 1059
Accuracy : 0.747
95% CI : (0.7307, 0.7627)
No Information Rate : 0.5047
P-Value [Acc > NIR] : <2e-16
Kappa : 0.4939
Mcnemar's Test P-Value : 0.5793
Sensitivity : 0.7548
Specificity : 0.7390
Pos Pred Value : 0.7466
Neg Pred Value : 0.7474
Prevalence : 0.5047
Detection Rate : 0.3809
Detection Prevalence : 0.5102
Balanced Accuracy : 0.7469
'Positive' Class : 0
# sensitivity is the true positive rate
# specificity is the true negative rate
<-roc(response = testN$dd_democracy,
ROCpredictor = model1_pred,
levels = c(1, 0))
Setting direction: controls > cases
plot(ROC, col = "blue", main = "ROC Curve for Democracy Classifier")
<- auc(ROC)
aucValue print(paste("AUC:", aucValue))
[1] "AUC: 0.824289975049948"
text(x = 0.6, y = 0.3, label = paste("AUC =", round(aucValue, 3)),
cex = 1.2, col = "blue")
Experiment 2b: LIWC features only with a smaller dataset
When I trained the model with only a small number of training dataset and tested against a large number test dataset, the model perform poorly with 52% of accuracy, as expected.
<- data %>%
small_trainN sample_n(size = 100, replace = FALSE)
<- glm(
model1b formula = dd_democracy ~ . ,
family = "binomial",
data = small_trainN)
<-predict(model1b, testN, type = "response") model1b_pred
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
<- as.factor(ifelse(model1b_pred > 0.5, 1,0))
model1b_diagnosis
# Compute confusion matrix
::confusionMatrix(as.factor(model1b_diagnosis), as.factor(testN$dd_democracy)) caret
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 725 584
1 735 849
Accuracy : 0.5441
95% CI : (0.5257, 0.5623)
No Information Rate : 0.5047
P-Value [Acc > NIR] : 1.202e-05
Kappa : 0.089
Mcnemar's Test P-Value : 3.625e-05
Sensitivity : 0.4966
Specificity : 0.5925
Pos Pred Value : 0.5539
Neg Pred Value : 0.5360
Prevalence : 0.5047
Detection Rate : 0.2506
Detection Prevalence : 0.4525
Balanced Accuracy : 0.5445
'Positive' Class : 0
<-roc(response = testN$dd_democracy,
ROCpredictor = model1b_pred,
levels = c(1, 0))
Setting direction: controls < cases
plot(ROC, col = "blue", main = "ROC Curve for Democracy Classifier")
<- auc(ROC)
aucValue print(paste("AUC:", aucValue))
[1] "AUC: 0.459668384173446"
text(x = 0.6, y = 0.3, label = paste("AUC =", round(aucValue, 3)),
cex = 1.2, col = "blue")
Experiment 3: Country level meta data only
I train the model using country-year level meta data alone. Caveat here is that some of the input features from the V-Dem data set are highly correlated with the output feature, “dd_democracy.” However, this model shows a high level of accuracy score of 91%.
# Remove the country variable from the training data
<- controls_only[, !(names(controls_only) %in% c("ccode_iso", "year"))]
controls_only
set.seed(2)
<- rsample::initial_split(controls_only, prop = 0.7, strata = "dd_democracy")
split <- rsample::training(split)
trainN <- rsample::testing(split)
testN
<- glm(
model3 formula = dd_democracy ~ .,
family = "binomial",
data = trainN)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
<-predict(model3, testN, type = "response")
model3_pred
<- as.factor(ifelse(model3_pred > 0.5, 1,0))
model3_diagnosis
# Compute confusion matrix
::confusionMatrix(as.factor(model3_diagnosis), as.factor(testN$dd_democracy)) caret
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 601 4
1 1 581
Accuracy : 0.9958
95% CI : (0.9902, 0.9986)
No Information Rate : 0.5072
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9916
Mcnemar's Test P-Value : 0.3711
Sensitivity : 0.9983
Specificity : 0.9932
Pos Pred Value : 0.9934
Neg Pred Value : 0.9983
Prevalence : 0.5072
Detection Rate : 0.5063
Detection Prevalence : 0.5097
Balanced Accuracy : 0.9958
'Positive' Class : 0
# sensitivity is the true positive rate
# specificity is the true negative rate
<-roc(response = testN$dd_democracy,
ROCpredictor = model3_pred,
levels = c(1, 0))
Setting direction: controls > cases
plot(ROC, col = "blue", main = "ROC Curve for Democracy Classifier")
<- auc(ROC)
aucValue print(paste("AUC:", aucValue))
[1] "AUC: 0.999900616179686"
text(x = 0.6, y = 0.3, label = paste("AUC =", round(aucValue, 3)),
cex = 1.2, col = "blue")
Experiment 4: training on Pre-Cold war, testing on post-cold war
This is a harder test. I train the model using LIWC features alone and test its performance in post-cold war era.
Accuracy : 0.6989
No Information Rate : 0.5867
P-Value [Acc > NIR] : < 2.2e-16
Sensitivity : 0.5752
Specificity : 0.7860
Change in the number of democracies after the Cold War. Predicting less democracy because it hasn’t seen that many democracies.
<- cbind(y, id, liwc_inputs)
data <- data[data$year < 1990, ] #dimension 4430 X 121
pre <- data[data$year >= 1990, ] # dimension 6138 X 121
post
# Remove the country variable from the training data
<- pre[, !(names(pre) %in% c("ccode_iso", "year", "democracy"))]
pre <- post[, !(names(post) %in% c("ccode_iso", "year", "democracy"))]
post
<- glm(
model4 formula = dd_democracy ~ .,
family = "binomial",
data = pre)
<-predict(model4, post, type = "response") model4_pred
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
<- as.factor(ifelse(model4_pred > 0.5, 1,0))
model4_diagnosis
# Compute confusion matrix
::confusionMatrix(as.factor(model4_diagnosis), as.factor(post$dd_democracy)) caret
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1336 676
1 985 2619
Accuracy : 0.7042
95% CI : (0.6921, 0.7162)
No Information Rate : 0.5867
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3779
Mcnemar's Test P-Value : 4.116e-14
Sensitivity : 0.5756
Specificity : 0.7948
Pos Pred Value : 0.6640
Neg Pred Value : 0.7267
Prevalence : 0.4133
Detection Rate : 0.2379
Detection Prevalence : 0.3583
Balanced Accuracy : 0.6852
'Positive' Class : 0
# sensitivity is the true positive rate
# specificity is the true negative rate
<-roc(response = post$dd_democracy,
ROCpredictor = model4_pred,
levels = c(1, 0))
Setting direction: controls > cases
plot(ROC, col = "blue", main = "ROC Curve for Democracy Classifier")
<- auc(ROC)
aucValue print(paste("AUC:", aucValue))
[1] "AUC: 0.753212961552468"
text(x = 0.6, y = 0.3, label = paste("AUC =", round(aucValue, 3)),
cex = 1.2, col = "blue")
Experiment 3b: training on Post-Cold war, testing on Pre-cold war
This is a harder test. I train the model using LIWC features alone and test its performance in pre-cold war era. This is a reversed version of Experiment 4. Instead of predicting the later outcome based on earlier observations, I train the data on the post-Cold war era and see if it can correctly predict the past.
Accuracy : 0.599
95% CI : (0.5837, 0.6142) No Information Rate : 0.6357
P-Value [Acc > NIR] : 1
Sensitivity : 0.4602
Specificity : 0.8413
<- glm(
model4b formula = dd_democracy ~ .,
family = "binomial",
data = post)
<-predict(model4b, pre, type = "response") model4b_pred
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
<- as.factor(ifelse(model4b_pred > 0.5, 1,0))
model4b_diagnosis
# Compute confusion matrix
::confusionMatrix(as.factor(model4b_diagnosis), as.factor(pre$dd_democracy)) caret
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1169 231
1 1393 1237
Accuracy : 0.597
95% CI : (0.5817, 0.6122)
No Information Rate : 0.6357
P-Value [Acc > NIR] : 1
Kappa : 0.2557
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.4563
Specificity : 0.8426
Pos Pred Value : 0.8350
Neg Pred Value : 0.4703
Prevalence : 0.6357
Detection Rate : 0.2901
Detection Prevalence : 0.3474
Balanced Accuracy : 0.6495
'Positive' Class : 0
# sensitivity is the true positive rate
# specificity is the true negative rate
<-roc(response = pre$dd_democracy,
ROCpredictor = model4b_pred,
levels = c(1, 0))
Setting direction: controls > cases
plot(ROC, col = "blue", main = "ROC Curve for Democracy Classifier")
<- auc(ROC)
aucValue print(paste("AUC:", aucValue))
[1] "AUC: 0.718717761370864"
text(x = 0.6, y = 0.3, label = paste("AUC =", round(aucValue, 3)),
cex = 1.2, col = "blue")
Experiment 5: Splitting based on English-speaking countries
separate out countries by removing a set of countries, english speaking countries vs. non-english speaking countries. (translator as a confounder)
- First, load in the meta data about which language the leader chose. “Speeches are typically delivered in the native language. Based on the rules of the Assembly, all statements are then translated by UN staff into the six official languages of the UN. If a speech was delivered in a language other than English, Baturo et al.(2017) used the official English version provided by the UN. Therefore, all of the speeches in the UNGDC are in English.”
<-read_excel("data/raw/language.xlsx")
language
<- language%>%select(year="Year", ccode_iso = "ISO Code",lang = "Language" )
language
<- language %>%
language mutate(eng = ifelse(
is.na(lang), NA, ifelse(
%in% c("English"), 1, 0))) %>%
lang select(-lang)%>%
filter(!is.na(eng))
#baseline data includes liwc_inputs only
<- cbind(y, id, liwc_inputs)
data
<-data%>%
datainner_join(language, by=c("ccode_iso", "year"))
Train dataset based on non-english and test against english
- Among 7270 speeches from 1970 until 2014, 99 were in English, while the other 3628 were in non-english. 3543 were NA values. Given a small number of samples that have complete observations of the language information after splitting into testing and training dataset, model5 performs poorly compared to the previous tests.
<- data[data$eng==1, ]
eng <- data[data$eng==0, ]
non_eng
# Remove the country variable from the training data
<- non_eng[, !(names(non_eng) %in% c("ccode_iso", "year", "democracy"))]
non_eng <- eng[, !(names(eng) %in% c("ccode_iso", "year", "democracy"))]
eng
<- glm(
model5 formula = dd_democracy ~ .,
family = "binomial",
data = non_eng)
<-predict(model5, eng, type = "response") model5_pred
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
<- as.factor(ifelse(model5_pred > 0.5, 1,0))
model5_diagnosis
# Compute confusion matrix
::confusionMatrix(as.factor(model5_diagnosis), as.factor(eng$dd_democracy)) caret
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 16 20
1 14 48
Accuracy : 0.6531
95% CI : (0.5502, 0.7464)
No Information Rate : 0.6939
P-Value [Acc > NIR] : 0.8382
Kappa : 0.2266
Mcnemar's Test P-Value : 0.3912
Sensitivity : 0.5333
Specificity : 0.7059
Pos Pred Value : 0.4444
Neg Pred Value : 0.7742
Prevalence : 0.3061
Detection Rate : 0.1633
Detection Prevalence : 0.3673
Balanced Accuracy : 0.6196
'Positive' Class : 0
# sensitivity is the true positive rate
# specificity is the true negative rate
<-roc(response = eng$dd_democracy,
ROCpredictor = model5_pred,
levels = c(1, 0))
Setting direction: controls > cases
plot(ROC, col = "blue", main = "ROC Curve for Democracy Classifier")
<- auc(ROC)
aucValue print(paste("AUC:", aucValue))
[1] "AUC: 0.672058823529412"
text(x = 0.6, y = 0.3, label = paste("AUC =", round(aucValue, 3)),
cex = 1.2, col = "blue")
Experiment 5b: Train dataset based on english and test on non-english
Model 5b shows even worse performance than model5, due to its small number of sample from which the model is trained.
<- glm(
model5b formula = dd_democracy ~ .,
family = "binomial",
data = eng)
<-predict(model5b, non_eng, type = "response") model5b_pred
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
<- as.factor(ifelse(model5b_pred > 0.5, 1,0))
model5b_diagnosis
# Compute confusion matrix
::confusionMatrix(as.factor(model5b_diagnosis), as.factor(non_eng$dd_democracy)) caret
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1195 801
1 975 492
Accuracy : 0.4871
95% CI : (0.4704, 0.5039)
No Information Rate : 0.6266
P-Value [Acc > NIR] : 1
Kappa : -0.067
Mcnemar's Test P-Value : 4.041e-05
Sensitivity : 0.5507
Specificity : 0.3805
Pos Pred Value : 0.5987
Neg Pred Value : 0.3354
Prevalence : 0.6266
Detection Rate : 0.3451
Detection Prevalence : 0.5764
Balanced Accuracy : 0.4656
'Positive' Class : 0
# sensitivity is the true positive rate
# specificity is the true negative rate
<-roc(response = non_eng$dd_democracy,
ROCpredictor = model5b_pred,
levels = c(1, 0))
Setting direction: controls < cases
plot(ROC, col = "blue", main = "ROC Curve for Democracy Classifier")
<- auc(ROC)
aucValue print(paste("AUC:", aucValue))
[1] "AUC: 0.534557222335083"
text(x = 0.6, y = 0.3, label = paste("AUC =", round(aucValue, 3)),
cex = 1.2, col = "blue")
#save.image(file='exp5.RData')
Experiment 6: Training across different continents.
I split the data along geographical groups. There are two options: region and continent. Continents and Regions are defined by the World Bank Development Indicators. Experiment 6 is based on continent categories. These are Africa, Americas, Asia, Europe, and Oceania.
I use leave-one-out validation method to test one remaining continent group, based on the other four continents. The tradeoff between specificity and sensitivity scores is contingent on the proportion of democracy and non-democracy observed in the testing dataset. For example, sensitivity score is 0.44 while specificity is 0.98 when testing against Europe, after training the model with the other four continents. The proportion of democracy relative to non-democracy is higher in Europe than the other continents, leading to high number of false negatives. The same pattern of higher specificity than sensitivity is apparent when testing on American continent. Testing dataset on Africa had a balance between specificity and sensitivity, while testing result on Asia showed higher sensistivity than specificity.
<- list()
model_summaries
<- cbind(y, id, liwc_inputs, controls)
data_controls
<- data_controls %>%
data_controls mutate(region = countrycode(ccode_iso, origin = "iso3c", destination = "region"))%>%
mutate(continent = countrycode(ccode_iso, origin = "iso3c", destination = "continent"))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `region = countrycode(ccode_iso, origin = "iso3c", destination =
"region")`.
Caused by warning:
! Some values were not matched unambiguously: CSK, CZK, DDR, EU, YMD, YUG
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `continent = countrycode(ccode_iso, origin = "iso3c",
destination = "continent")`.
Caused by warning:
! Some values were not matched unambiguously: CSK, CZK, DDR, EU, YMD, YUG
$continent<-as.factor(data_controls$continent)
data_controls<- data_controls[!is.na(data_controls$continent), ]
data_controls
# Split data into 5 continent categories
<- list()
continent_data
# Iterate through each continent
for (cont in unique(data_controls$continent)) {
<- data_controls[data_controls$continent == cont, ]
continent_subset <- continent_subset[!is.na(continent_subset$dd_democracy), ]
continent_subset
<- continent_subset
continent_data[[cont]]
<- sum(continent_subset$dd_democracy == 1)
positive_cases <- sum(continent_subset$dd_democracy == 0)
negative_cases cat("Continent:", cont, "\n")
cat("Positive Cases:", positive_cases, "\n")
cat("Negative Cases:", negative_cases, "\n\n")
}
Continent: Asia
Positive Cases: 728
Negative Cases: 1684
Continent: Africa
Positive Cases: 548
Negative Cases: 2227
Continent: Europe
Positive Cases: 1574
Negative Cases: 351
Continent: Americas
Positive Cases: 1495
Negative Cases: 535
Continent: Oceania
Positive Cases: 418
Negative Cases: 86
# Iterate through each continent
for (cont in unique(data_controls$continent)) {
# Create training and testing data based on the current continent
<- rbind(data_controls[data_controls$continent != cont, ])
train_data <- data_controls[data_controls$continent == cont, ]
test_data <- train_data[, !(names(train_data) %in% c("ccode_iso", "continent", "region"))]
train_data <- test_data[, !(names(test_data) %in% c("ccode_iso", "continent", "region"))]
test_data
# Train model
<- glm(
model formula = dd_democracy ~ .,
family = "binomial",
data = train_data
)
# Make predictions on test data
<- predict(model, newdata = test_data, type = "response")
model_pred
# Compute evaluation metrics
<- caret::confusionMatrix(
confusion_matrix as.factor(ifelse(model_pred > 0.5, 1, 0)),
as.factor(test_data$dd_democracy)
)cat("Continent:", cont, "\n")
# Print confusion matrix
print(confusion_matrix)
# Save model summary
<- summary(model)
model_summary
# Add continent information to the model summary
$continent <- cont
model_summary
# Append the model summary to the list
<- model_summary
model_summaries[[cont]]
}
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Continent: Asia
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 579 43
1 54 440
Accuracy : 0.9131
95% CI : (0.895, 0.929)
No Information Rate : 0.5672
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8234
Mcnemar's Test P-Value : 0.3099
Sensitivity : 0.9147
Specificity : 0.9110
Pos Pred Value : 0.9309
Neg Pred Value : 0.8907
Prevalence : 0.5672
Detection Rate : 0.5188
Detection Prevalence : 0.5573
Balanced Accuracy : 0.9128
'Positive' Class : 0
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Continent: Africa
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1127 45
1 37 269
Accuracy : 0.9445
95% CI : (0.9316, 0.9556)
No Information Rate : 0.7876
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8326
Mcnemar's Test P-Value : 0.4395
Sensitivity : 0.9682
Specificity : 0.8567
Pos Pred Value : 0.9616
Neg Pred Value : 0.8791
Prevalence : 0.7876
Detection Rate : 0.7625
Detection Prevalence : 0.7930
Balanced Accuracy : 0.9125
'Positive' Class : 0
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Continent: Europe
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 59 1
1 3 568
Accuracy : 0.9937
95% CI : (0.9838, 0.9983)
No Information Rate : 0.9017
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9637
Mcnemar's Test P-Value : 0.6171
Sensitivity : 0.95161
Specificity : 0.99824
Pos Pred Value : 0.98333
Neg Pred Value : 0.99475
Prevalence : 0.09826
Detection Rate : 0.09350
Detection Prevalence : 0.09509
Balanced Accuracy : 0.97493
'Positive' Class : 0
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Continent: Americas
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 174 44
1 3 566
Accuracy : 0.9403
95% CI : (0.9214, 0.9558)
No Information Rate : 0.7751
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8417
Mcnemar's Test P-Value : 5.392e-09
Sensitivity : 0.9831
Specificity : 0.9279
Pos Pred Value : 0.7982
Neg Pred Value : 0.9947
Prevalence : 0.2249
Detection Rate : 0.2211
Detection Prevalence : 0.2770
Balanced Accuracy : 0.9555
'Positive' Class : 0
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning in confusionMatrix.default(as.factor(ifelse(model_pred > 0.5, 1, :
Levels are not in the same order for reference and data. Refactoring data to
match.
Continent: Oceania
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 0 0
1 0 40
Accuracy : 1
95% CI : (0.9119, 1)
No Information Rate : 1
P-Value [Acc > NIR] : 1
Kappa : NaN
Mcnemar's Test P-Value : NA
Sensitivity : NA
Specificity : 1
Pos Pred Value : NA
Neg Pred Value : NA
Prevalence : 0
Detection Rate : 0
Detection Prevalence : 0
Balanced Accuracy : NA
'Positive' Class : 0
Experiment 7: Training across different ccontinents with LIWC features alone.
I rerun the same experiment excluding country level metadata to check the performance of LIWC features as standalone predictors.
<- list()
model_summaries
<- cbind(y, id, liwc_inputs)
data_controls
<- data_controls %>%
data_controls mutate(region = countrycode(ccode_iso, origin = "iso3c", destination = "region"))%>%
mutate(continent = countrycode(ccode_iso, origin = "iso3c", destination = "continent"))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `region = countrycode(ccode_iso, origin = "iso3c", destination =
"region")`.
Caused by warning:
! Some values were not matched unambiguously: CSK, CZK, DDR, EU, YMD, YUG
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `continent = countrycode(ccode_iso, origin = "iso3c",
destination = "continent")`.
Caused by warning:
! Some values were not matched unambiguously: CSK, CZK, DDR, EU, YMD, YUG
$continent<-as.factor(data_controls$continent)
data_controls<- data_controls[!is.na(data_controls$continent), ]
data_controls
# Split data into 5 continent categories
<- list()
continent_data
# Iterate through each continent
for (cont in unique(data_controls$continent)) {
<- data_controls[data_controls$continent == cont, ]
continent_subset <- continent_subset[!is.na(continent_subset$dd_democracy), ]
continent_subset
<- continent_subset
continent_data[[cont]]
<- sum(continent_subset$dd_democracy == 1)
positive_cases <- sum(continent_subset$dd_democracy == 0)
negative_cases cat("Continent:", cont, "\n")
cat("Positive Cases:", positive_cases, "\n")
cat("Negative Cases:", negative_cases, "\n\n")
}
Continent: Asia
Positive Cases: 728
Negative Cases: 1684
Continent: Africa
Positive Cases: 548
Negative Cases: 2227
Continent: Europe
Positive Cases: 1574
Negative Cases: 351
Continent: Americas
Positive Cases: 1495
Negative Cases: 535
Continent: Oceania
Positive Cases: 418
Negative Cases: 86
# Iterate through each continent
for (cont in unique(data_controls$continent)) {
# Create training and testing data based on the current continent
<- rbind(data_controls[data_controls$continent != cont, ])
train_data <- data_controls[data_controls$continent == cont, ]
test_data <- train_data[, !(names(train_data) %in% c("ccode_iso", "continent", "region"))]
train_data <- test_data[, !(names(test_data) %in% c("ccode_iso", "continent", "region"))]
test_data
# Train model
<- glm(
model formula = dd_democracy ~ .,
family = "binomial",
data = train_data
)
# Make predictions on test data
<- predict(model, newdata = test_data, type = "response")
model_pred
# Compute evaluation metrics
<- caret::confusionMatrix(
confusion_matrix as.factor(ifelse(model_pred > 0.5, 1, 0)),
as.factor(test_data$dd_democracy)
)cat("Continent:", cont, "\n")
# Print confusion matrix
print(confusion_matrix)
# Save model summary
<- summary(model)
model_summary
# Add continent information to the model summary
$continent <- cont
model_summary
# Append the model summary to the list
<- model_summary
model_summaries[[cont]]
}
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
Continent: Asia
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1090 201
1 594 527
Accuracy : 0.6704
95% CI : (0.6512, 0.6891)
No Information Rate : 0.6982
P-Value [Acc > NIR] : 0.9985
Kappa : 0.3219
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.6473
Specificity : 0.7239
Pos Pred Value : 0.8443
Neg Pred Value : 0.4701
Prevalence : 0.6982
Detection Rate : 0.4519
Detection Prevalence : 0.5352
Balanced Accuracy : 0.6856
'Positive' Class : 0
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
Continent: Africa
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1207 146
1 1020 402
Accuracy : 0.5798
95% CI : (0.5612, 0.5983)
No Information Rate : 0.8025
P-Value [Acc > NIR] : 1
Kappa : 0.1721
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.5420
Specificity : 0.7336
Pos Pred Value : 0.8921
Neg Pred Value : 0.2827
Prevalence : 0.8025
Detection Rate : 0.4350
Detection Prevalence : 0.4876
Balanced Accuracy : 0.6378
'Positive' Class : 0
Continent: Europe
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 301 638
1 50 936
Accuracy : 0.6426
95% CI : (0.6207, 0.664)
No Information Rate : 0.8177
P-Value [Acc > NIR] : 1
Kappa : 0.2739
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8575
Specificity : 0.5947
Pos Pred Value : 0.3206
Neg Pred Value : 0.9493
Prevalence : 0.1823
Detection Rate : 0.1564
Detection Prevalence : 0.4878
Balanced Accuracy : 0.7261
'Positive' Class : 0
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
Continent: Americas
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 445 837
1 90 658
Accuracy : 0.5433
95% CI : (0.5214, 0.5652)
No Information Rate : 0.7365
P-Value [Acc > NIR] : 1
Kappa : 0.1877
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8318
Specificity : 0.4401
Pos Pred Value : 0.3471
Neg Pred Value : 0.8797
Prevalence : 0.2635
Detection Rate : 0.2192
Detection Prevalence : 0.6315
Balanced Accuracy : 0.6360
'Positive' Class : 0
Continent: Oceania
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 25 61
1 61 357
Accuracy : 0.7579
95% CI : (0.7181, 0.7947)
No Information Rate : 0.8294
P-Value [Acc > NIR] : 1
Kappa : 0.1448
Mcnemar's Test P-Value : 1
Sensitivity : 0.2907
Specificity : 0.8541
Pos Pred Value : 0.2907
Neg Pred Value : 0.8541
Prevalence : 0.1706
Detection Rate : 0.0496
Detection Prevalence : 0.1706
Balanced Accuracy : 0.5724
'Positive' Class : 0
LIWC on no_stop_words
"~/Desktop/UNGDC/data/liwc/no_stop_words_analyzed.csv"
#Logit Boost
library(caTools)
= data_controls
Data = LogitBoost(Data, dd_democracy, nIter=20)
model = predict(model, Data)
Lab = predict(model, Data, type="raw")
Prob = cbind(Lab, Prob)
t 1:10, ]
t[
# two alternative call syntax
=predict(model,Data)
p=predict.LogitBoost(model,Data)
q=p[!is.na(p)]; qq=q[!is.na(q)]
ppstopifnot(pp == qq)