Love, Language, and Linear Algebra: Linguistic Modeling of Personality and Mate Preference

This study utilized Latent Semantic Analysis to determine whether similarities in personality predicted similarities in responses to a romantic writing prompt (Landauer & Dumais, 1997). From participants’ writing samples, we calculated thematic cosines (a measure of relatedness) between each male and female participant. Participants also completed the Big Five Personality Questionnaire Short Form (Morizet, 2014). Extraversion, agreeableness, and conscientiousness were related to cosines, which suggested small-medium relationships from personality traits to written responses. This relationship was consistent with previous studies; therefore, Latent Semantic Analysis may be useful in quantifying mate preference, especially when alongside traditional survey methods. We conclude with a discussion of the compatibility of ordinal measures (survey data) and continuous measures in examining complex phenomena in the Behavioral Sciences.


Love, Language, and Linear Algebra: Linguistic Modeling of Personality and Mate Preference
Sexual and romantic desirability are vital in forming a basic unit of human culture, the mated pair. Through natural selection, general preference for certain traits, such as intelligence and physique, lead to our evolution as a species. Mate preference, an individual's abstract set of desirable traits in a mate, defines many cultural phenomena. In many ways, our similarities shape our behavioral and biological identities. As an example, Thornhill and Gangestad (1994) showed that fluctuating asymmetry (deviations in human physiology which are not left-right symmetric across the body) were negatively correlated with the number of sexual partners, which is related to mate preference and mating choices. Also, general evolutionary theories, such as runaway selection (strong preferences for expression traits, such as coloring of male peacocks, that override natural selection of adaptive traits), are often used to explain cognitive advancements in our hominid ancestors (Miller, 2000).
Mate preference also influences our social roles and environments. For example, the literature has suggested that men value attractiveness more than women in survey based research paradigms. This specific sex difference was observed by Feingold (1990) in a meta-analysis of 28 separate samples of American females and males. However, Feingold also found similar differences in personal ads and billboards targeting males and females, which suggests that survey based results often correlate realistically with real-world sex differences in mate preference. He also compared this meta-data with linguistic analyses of advertisements and billboards targeted towards men or women specifically. Interestingly, he noted that advertisements targeting men focus on attractive female partners more than advertisements for women.
In a far-reaching cross-sectional study, Buss (1989) examined sex differences in mate preference across 37 samples from 33 distinct cultural paradigms. To compare mate preference and sex differences across cultures, Buss administered a three-part survey. This survey asked for participants' demographic information (age, sex, religious, and familial background). The second portion of the survey asked participants for their ideal age to marry, their preferred age difference to a potential spouse, and how many children they desired. The final section asked participants to rate 18 characteristics (e.g., sociability, intelligence, chastity) on how important they were in determining a potential romantic partner. Incredibly, Buss found that sex differences in mate preference were almost entirely homogeneous across all cultures. Examples included higher preference among women for fiscally stable partners, and higher preference among men for younger female partners.
Within the same study, Buss (1989) also carefully checked census data from each country to determine how mate preference influenced mate choices. As an example, in every culture studied, an age-gap of approximately three years was found between older men and younger women in census data. This dovetailed neatly with the second survey section which assessed participants' ideal age difference between a potential mate. Yet, age differences are easilymeasured, external variables. Moreover, as stated by Buss, age differences were the most statistically reliable findings in his study, while other variables, such as previous sexual experience, showed weaker effects across different cultures. Buss's (1989) and Feingold's (1990) research suggests that mate preference is a valid cognitive construct in multiple cultures and paradigms. Moreover, certain sex differences in preference, such as physical attractiveness and age, are apparent in census and environmental data.
Yet, the relationship of other traits, such as personality or intelligence, to concrete mate choice is more complex. In survey-based research of Brazilian college students, Castro, Hattori, and Lopez (2012) found that preferences in non-physical traits (i.e., humor, intelligence) did not always correlate with concrete perceptions of current or recent mates. Their results show how mate preference may significantly differ across sex within a sample without necessarily predicting individuals' perceptions of real-world romantic partners (Castro et al., 2012). Castro et al.'s (2012) findings also illustrate the difference between mate preference and mate choice. Mate preference is the set of traits a given individual would find desirable in a mate. In contrast, mate choice are the real sexual or romantic choices an individual makes in the real world. For example, a given person may claim to find brown eyes more attractive than other eye colors. However, this person may choose to date someone with grey eyes. Thus, while their mate preference is for brown eyes, their actual mate choice differed entirely with respect to this real partner with grey eyes. Much of the research discussed so far deals with mate preference, which is what we chose to examine in this study, as opposed to concrete mate choices.
Toro- Morn and Sprecher (2003) further examined Buss's cross-cultural findings by distributing preferred mate characteristic surveys to university students in the United States of America as well as the Peoples Republic of China (PRC). They asked participants to rate features such as, "Honest and trustworthy", "Intelligent", "Sexy looking", and "Wealthy." Toro-Morn and Sprecher found that, although both US and PRC students valued relational attributes which contributed to long-term stability, such as honesty and health, they also had differences which were attributed to unique cultural differences (for example, USA participants rated "Wants Children" as more important). Interestingly, between males and females, significant gender differences were observed within both the US and PRC samples. For example, men focused on physical attractiveness, while women tended to desire status in the form of wealth or social status. For individuals, desirable personality traits in a mate are often those which mirror their own (Botwin, Buss, & Shackelford, 1997). In Botwin et al.'s (1997)  Five Factor Model provides an understandable method of explaining these differences for personality research, which relies on measuring differences in personality in a large sample.
These scales have been tested across multiple demographics (i.e., age, gender, nationality, etc.) and were originally developed by examining personality research in the 1960's and 1970's across these same demographics. We suggest the interested reader read McCrae and John's (1992) survey article of the Five Factor Model, which is heavily cited in this and the preceding paragraphs. Botwin et al. (1997) found that, in relationships which had lasted longer than a year, personality differences across the Five Factors were predictive of relational unhappiness. Longterm partners were likely to exhibit similar personality traits, showing a distinct connection between personality preferences in potential mates and successful long-term romantic relationships. Even more, among all participants, Botwin et al. (1997) found that certain personality traits were unappealing in a potential mate. Specifically, they found that low agreeableness, low emotional stability, and non-equal openness to experience was universally undesirable for both men and women. Botwin et al.'s (1997) results suggest that personality, as measured in the Five Factor model, has a strong influence on mate preference and the long-term outcomes of concrete mate choices Yet, personality is a factor in which Castro et al. (2012) suggest plays a lesser role in mate preference, especially among males. These studies tell us several things. First, that there is some general effect of personality on mate preference. Second, beyond this general effect, there are certain dimensions of personality, such as agreeableness or openness, which seem to be stronger predictors (and possibly more desirable) for mate preference. Finally, although an effect has been observed, there is no exact consensus on the size or specific nature of this effect across multiple studies with differing hypotheses and research design. This discrepancy justifies confirmatory research with novel methodology focusing on the Big Five and mate preference to determine the size and reliability of personality's effect in the larger population's mating preferences.
If we assume that the findings of Botwin et al. (1997) and Buss (1989) are representative of the larger population, we should expect similar results in other studies, including those which use non-survey based measurements, such as textual analysis of participants' writing. This kind of convergent validity is essential for multiple reasons. The most obvious is that it establishes the presence of a meaningful effect of personality in the population's mate preferences. Further, it enriches our understanding of the exact function of personality as an influencer of mate preference. Also, from a standpoint of meta-analysis, multiple methodologies give a clearer picture of the population effect size, which are prone to misinterpretation or uncertainty in individual studies (Stukas & Cumming, 2014). Finally, because participants are free to respond to written prompts, textual analysis represents a truly continuous measurement of an effect, strengthening the generalizability results obtained from survey research.
This study examined the effect of personality differences in each of the Five Factors on mate preference among males and females. However, unlike previously mentioned research, we measured participants' mate preference through written responses to a prompt. We hypothesized that, like previous non-linguistic research, similarity in participants' personality scores would predict similar mate preferences as recorded through responses to a written prompt. To test this hypothesis, we needed to define what similarity in mate preference meant in the context of our written prompt, and provide a method for quantitatively measuring said similarity.

Latent Semantic Analysis
To this end, we utilized Latent Semantic Analysis (LSA), an algebraic technique which converts word frequency and co-occurrence into thematic cosines. (Landauer, Foltz, & Laham, 1998). Currently, there are several common methods for textual analysis in quantitative psychological research, such as LSA and Linguistic Inquiry and Word Count (LIWC; Pennebaker, Boyd, Jordan, & Blackburn, 2015). LIWC is a text analysis program which counts the occurrence of words with implicit psychological meanings and has been utilized to detect meaning in varied areas of empirical psychological research (Tausczik & Pennebaker, 2010). However, LSA is fundamentally different from LIWC in its input, mathematical structure, and quantitative output.
LSA measures all individual word occurrences across an input corpus without categorizing words into distinct categories. Moreover, this input corpus may be composed of arbitrarily-many distinct documents, ranging from a handful to hundreds-of-thousands of individual texts. Researchers then create a sample space from the input documents, whose unique linguistic qualities are determined by individual word co-occurrence. Based on this word cooccurrence, each document is then assigned a position in the sample space. This sample space allows us to calculate a similarity score, called a thematic cosine, between each document. Like a correlation, higher scores represent more similarity, and lower scores represent less similarity, as determined by position in the larger sample space (Landauer & Dumais, 1997).
Intuitively, we can think of LSA as a social media network for documents. Imagine a group of close friends on Facebook: many of their experiences, language, and references will be similar. If we then view this group of friends as a smaller portion of a larger social network, we should be able to notice (at least qualitatively) a higher level of similarity between profiles and posts of our original group of friends when compared to the larger sample. In making these connections, we have established a sort of relationship between the members of our social network, which is exactly the aim of Latent Semantic Analysis when examining a collection of writing samples. The motivation for Latent Semantic Analysis is simply to quantify the relationship between a set of writing samples, which provides a framework for quantitative analysis of qualitative (text) data.
Mathematically, a base measure of similarity between documents is constructed through word co-occurrence, which can then be extrapolated to the entire sample. Word occurrence is how often two writing samples share the same word choices. This measurement is encoded into a matrix structure. Each vocabulary word is represented by a row, and each document is represented by a column. Within each cell, the number of occurrences of a given word in a specific document is recorded. For example, if the fourth row represents the word "alpine" and document three used the word "alpine" seven times, we would expect the third column of the fourth row of our matrix to have a seven as its entry. Next, these row and column vectors are then compared to construct a thematic cosine, in a similar fashion as correlation analysis. This thematic cosine is then our measurement of similarity between two documents. Because each document has a column and each word has a row, we see that each individual word in each document is accounted for in constructing our thematic cosine. As a mathematical model of thematic similarity, Latent Semantic Analysis has been extremely useful in demonstrating patterns within linguistic corpora with thousands of citations for its use. For a recent example, Gefen et al. (2018) applied LSA to medical records, accurately pairing keywords with medical conditions across all records. LSA has also been utilized to model personality traits (Kwantes, Derbentseva, Lam, Vartanian, & Marmurek, 2016), topic modelling of political debates (Valdez, Pickett, & Goodson, 2018), and automatically grading essays (Williams, 2006). The demonstrated use and applicability of LSA in measuring betweendocument similarity makes it an ideal choice for measuring similarity in participants' writing.
Thus, motivated by the positive findings of Buss (1989) and Botwin et al. (1999) regarding personality and mate preference, we designed our study to measure a similar effect in participants writing. Thus, we hypothesized that, between heterosexual males and females, similar scores in each of the Five Factors would predict similar responses to a romantic writing prompt. Here, similarity in writing is defined as a higher thematic cosine score. Since this prompt asked participants to define an ideal romantic situation, we posited that the thematic cosine between participants would measure similarity in mate preferences. So, while we did not directly ask participants whether they favored certain traits (i.e., intelligence, physical fitness, etc.), we did measure their mate preferences in a romantic scenario (i.e,. a first date). Having calculated thematic cosines and personality difference scores, we then utilized a Multilevel Model, with each of the Five Factors being examined as an effect. This analysis allowed us to determine which personality dimensions had significant effects on similarity in written responses, as well as the size of these effects.

Method Participants
A sample of undergraduate students (N = 105) was recruited from a large Midwestern university. All participants were enrolled in an introductory psychology course and received two research-participation credits for completing the study. Relatively even samples of male (N = 54) and female (N = 51) participants were recruited. The average age of the participant was around 19 years of age (M = 18.75, SD = 1.60), and the majority were white (96.15%) with the remainder not answering (3.85%). Sample collection occurred over a two-month period from October through early-December. As described below, participants were required to include a writing sample of 2200 characters. Several participants did not meet this criteria and filled in random symbols to finish the study: n = 5 female, n = 10 male. Therefore, N = 90 participants' data were analyzed in the results.

Materials and Procedure
All participants received online survey materials through Qualtrics, an internet survey platform. After reporting demographic information (e.g., gender, age, academic major, ethnicity), participants completed the Big Five Personality Trait Short Questionnaire (Morizot, 2014), which assessed openness, extraversion, agreeableness, conscientiousness and emotional stability.
Finally, in random order, participants responded to a pair of writing prompts. One concerned their interests and hobbies ("Describe your interests and/or hobbies"), while the other asked them to describe their ideal romantic partner ("Describe an ideal date with your perfect romantic partner"). The order of prompts was counterbalanced, and responses had to exceed a minimum of 2200 characters to move on with the study. This requirement was to ensure enough information density in the writing samples to guarantee usable latent semantic data. For this specific study, we did not utilize the interests-and-hobbies written data. In the interest of transparency, we reported this step in our methodology. Therefore, in this study, we only tested the relationship between similarity across each personality measure with romantic writing.

Results
Data analysis was conducted in two major steps: Latent Semantic Analysis to create the dependent thematic cosine variable, and several multilevel models (MLM) examining the influence of individual participants' personality differences on romantic writing similarity as measured by thematic cosines.

Latent Semantic Analysis
Raw written data were marked with a participant number, gender, and prompt number.
LSA was conducted in R using the lsa (Wild, 2015) package. Initially, LSA encodes the word frequency and co-occurrence of each participant's written response in a text-frequency matrix.
This matrix was normalized using log weighting to control for the sparsity/skew of text frequencies, that is, the differences in number of very frequently versus infrequently used words.
We also removed common English stop words (e.g., "the") to reduce the number of meaningless co-occurrences across writing samples (see Rajaraman and Ullman [2011] for justification). LSA was then performed, which created a matrix of concepts by documents with values in this matrix representing the relationship of each concept to a document. Cosine values between each malefemale participant combination were calculated, and therefore, the final dependent variable dataset included 2024 cosine values (e.g., Male Participant 1 to Female Participant 1, etc.; therefore, 44*46). The complete scripts and data set can be found at: https://osf.io/5qw67/.

Data Screening
Next, the independent variables were added to the cosine values. Difference scores were calculated by subtracting our male participant's score from our female participant's score across each personality variable. Following this subtraction, we took an absolute value to normalize the order effects of subtraction on our personality measure. Next, the data were analyzed for assumptions of parametric regression. Mahalanobis distance was calculated on the cosine scores and personality responses (Tabachnick & Fidell, 2012). Six participant-pairs were flagged exceeding the Mahalanobis cutoff score ( 2 (6)p<.001 = 22.46) and were excluded. Data were then screened for accuracy, additivity, normality, linearity, and homoscedasticity. The data were slightly skewed and heteroscedastic, however, with the large sample size of participant-pairs, the analysis should be robust to these violations.

Multilevel Model Analysis
Following data screening, descriptive statistics were calculated for romantic cosines and personality measures across both males and females. The average romantic cosine (M = .17, SD = .16) was relatively small and showed a comparatively large standard deviation. Personality scores ranged from 10-50 on an interval scale, although we utilized a difference score in our MLM. The difference scores could range from 0 (perfect match in personality scores) to 40 (most mismatch, 50-10). However, for convenience, Table 1 shows the original personality scores in means, standard deviations, and Cohen's ds (Lakens, 2013) across both males and females.
In our analysis, each personality variable was analyzed in a separate MLM. We chose this design to control for the correlated error introduced by examining each participant paired with every other opposite gender participant (i.e., therefore, controlling for Male Participant 1 being represented in the data multiple times across female participants). We compared three distinct models: an intercept-only model, which estimates the y-intercept as the same across all participants; a random-intercept model, which allows estimation of the y-intercept controlling for multiple instances of the same participant, thus handling correlated error; and a random-intercept model with personality differences as a predictor, which controls for repeated measures for each participant and estimates the relationship between the IV and the DV (Field, Miles, & Field, 2012).
Except for the MLM examining openness, the random-intercept model with predictors was the best fit for our data in each MLM. However, due to the repeated measures of the data, we included all models from the random-intercepts main effects, as we wished to control for correlated error. Model significance was evaluated using a chi-square difference test where each model is compared to the previous model to determine how adding random slopes or predictors improves the model; however, in order to determine the best-fit for our data, we utilized the Aikake Information Criterion (AIC). A lower AIC corresponds to less information lost, and hence, models with lower AIC scores correspond to better fits for our data. Individual model's degrees of freedom, intercepts, as well as significance among all models can be found in Table 2.
We found that differences in extraversion, agreeableness, and conscientiousness were predictors of similarities in thematic cosines across romantic writing. With negative slopes, this finding suggests that smaller differences in personality predicted larger thematic cosines. Therefore, as personality scores were more similar (small differences, closer to zero), the larger the overlap between the romantic writing provided by participants. Difference in emotional stability and openness were not predictors of similarity in thematic cosines. For convenience, see Table 3 for predictors, intercepts, standard errors, and p-values for each predictor.

Discussion
Our results show that similarity in extraversion, agreeableness, and conscientiousness predicted similarity in writing about a romantic partner. With the largest predictor b-value, agreeableness as a predictor aligns with existing findings by Botwin et al. (1997), who suggested that agreeableness was the strongest personality predictor for high mate value and relational satisfaction in concrete mate choices. Since our study examined mate preference specifically, we cannot draw conclusions related to mate choice. However, our results show that similar levels of agreeableness predict similarities in written responses. This finding suggests that further research in mate preference and personality may uncover a similar relationship of agreeableness to mate preference as in Botwin et al.'s studies on mate choice. Importantly, this similarity in previous research on personality and mate preference suggests that written measurements can return similar results to survey based research. LSA has already shown to be an adaptable tool, with applications in areas such as medical research (Gefes et al., 2018), personality (Kwantes et al., 2016), and education (Williams, 2006). However, this research suggests that LSA may provide new insight on the exact relationship of personality and mate preference. In the literature, most research examining mate preference utilizes questions concerning observed constructs related to mate preference (such as socio-economic status or personality), and usually measure these variable on a Likert-style scale (Buss, 1989;Castro et al., 2012). This method of analysis has several benefits, including: generalizability of results from study-to-study, ease of drawing meaningful conclusions from data, and simplification of replicability. However, it also suffers from similar drawbacks to survey data. For example, in our study, our sample was largely white, from the age of eighteen to twenty, and undergraduate psychology students. Naturally, this represents a challenge in generalizing our research to the larger population. What, then, justifies the future use of written measurements and LSA?
In the context of examining personality and mate preference, written measurement has many strengths. Written prompts allow participants to respond in a unique way before any data transformation takes place. For any single item on a Likert-style survey, there will always be identical responses. With written measurements, we see the exact opposite: barring experimental error, no two participants will ever contribute an identical writing sample. While we did not examine the effects of individual differences in this study, this area would be a reasonable nextstep in research.
While LSA is a valuable tool in many areas of research, it also presents several challenges, both theoretical and pragmatic. Foremost is the interpretability of results. Often when working with ordinal measurements, such as age (measured in years) or Likert-scales, descriptive statistics of a sample are easily interpreted and explained. That does not mean a specific sample's mean is the correct or ideal measurement of central tendency. However, it is easier to understand a statement such as, "Our sample had a mean age of 23 with a standard deviation of 2.5 years," than one like, "Our sample had a mean thematic cosine of .35 with a standard deviation of .25." Superficially, thematic cosines may be more difficult to interpret than a standard correlation, such as Pearson's r (1896). This difficulty is because, while thematic cosines and correlations both measure similarity, there are no traditional small, medium, or large score-markers for thematic cosines. However, the direction and magnitude interpretations for correlations and cosines are the same.
In this study, thematic cosines derived through LSA provide a continuous measurement of mate preference which were utilized to model the hypothesized effect of personality on mate preference. In this context, having a continuous variable is incredibly valuable. Continuous measures usually lead to a broader understanding of variance in a sample while avoiding common statistical problems associated with ordinal measurements. For example, smaller Likertstyle data (e.g. where responses range from 1 to 5) are more susceptible to Type I and Type II errors in parametric statistical tests (Gregoire & Driver, 1987) as opposed to a continuous measurement. Variable selection is a complicated issue, with many professional psychologists disagreeing on the use of Likert-style data in parametric statistical tests (see Rasmussen [1987] for a contrasting opinion to Gregoire and Driver [1987]).
Rather than replacing survey methods, we see Latent Semantic Analysis as a complementary tool in modelling mate preference. Moreover, in situations where ordinal data is either statistically inappropriate or cumbersome, Latent Semantic Analysis provides a broad and continuous measure for parametric statistical tests. This motivated our usage of a multilevel model analysis, and creates potential tools for future research beyond the initial findings of Buss (1989) and Botwin et al. (1999). Thus, in future studies on mate preference, when the hypothesis assumes an underlying continuous population distribution, LSA represents a useful method of modelling this distribution.
Of course, in an ideal situation, every hypothesis would be measured with several unique and contrasting measures. Since we ourselves only utilized Latent Semantic Analysis in this study, and did not present any complementary surveys, we naturally understand that resources and time are usually limited. Fortunately, Latent Semantic Analysis is relatively time-and-cost effective and can be executed using the lsa package (Wild, 2015) in R. For those interested in trying Latent Semantic Analysis for their next project (or just for fun), feel free to download our scripts and data utilized in this study from our OSF page: https://osf.io/5qw67/. In conclusion, we look forward to seeing the unique insight Latent Semantic Analysis can provide in many The intercept-only model and random-intercept model is identical for each IV, and hence is only listed once. Each personality factor model was compared to the random-intercept model for the change statistics ( 2 (1) and p). Note. df = 1979.