User Tools

Site Tools


Demographic data

Variable nameDescriptionDetails
useridUnique user identifier
genderGender of the user1=female, 0=male
relationship_statusRelationship statusValues: 1 = 'Single'; 2 = 'In a Relationship'; 3 = 'Married'; 4 = 'Engaged'; 5 = 'It's Complicated'; 6 = 'In an Open Relationship'; 7 = 'Widowed'; 8 = 'Divorced'; 9 = 'Separated'; 10 = 'In a domestic partnership'; 11 = 'In a Civil Union'; 12 = 'Hooked'
interested_inInterested In1 = Male, 2 = Female, 3 = Female, Male
mf_relationship Meeting others for relationship
mf_datingMeeting others for dating
mf_randomMeeting others for random play
mf_friendshipMeeting others for friendship
mf_whateverMeeting others for whatever I can get
mf_networkingMeeting others for networking
localeLanguage version of Facebook interfacee.g. en_US, en_UK, or pl_PL
network_sizeNumber of friendsNote that this tends to get outdated quickly – gives you an idea though!
timezoneUser's timezone

Facebook Activity

Variable nameDescriptionDetails
useridUnique user identifier
n_likeNumber of likes for this user
n_statusNumber of status updates for this user
n_eventNumber of events for this user
n_concentrationNumber of concentrations for this user
n_groupNumber of group memberships for this user
n_workNumber of work places for this user
n_educationNumber of education for this user
n_tagsNumber of photo tags in the user-photo_tags table
n_diadsNumber of diads in the friendship diads table

Personality Cross-Ratings

Personality as rated by user's friends. Results were obtained by asking users to respond to a 10-item version of the Five Factor Model IPIP questionnaire and describe the chosen friend. The 10 items were picked randomly from the 100 available. Raters' answers to individual questions (ranging 0-4) were added up to trait scores (ranging 0-8).

To estimate the average rating, you have to divide the score by the number of raters (the scores should range from 0 to 8).

Variable nameDescriptionDetails
useridUnique user identifier
friend_opeOpenness as rated by Facebook friends
friend_conConscientiousness as rated by Facebook friends
friend_extExtroversion as rated by Facebook friends
friend_agrAgreeableness as rated by Facebook friends
friend_neuNeuroticism as rated by Facebook friends
friend_total_ratersNumber of friends that rated user's personality

Rated Users and Who Rated Them

Contains a list of pairs of users: the person who was rated, and the person who did the rating. Note that we did not store the individual ratings from each person, only the aggregate of the ratings (in the personality cross-ratings table above). In cases where there is only one friend rater, it will be the user ID shown below. Basically, this table was used in myPersonality to check that a person was only rated by each friend once.

Variable nameDescriptionDetails
rated_user_idUnique user identifier for the person who got rated
rater_friend_id Unique user identifier for the friend who did the rating

Egocentric Networks Stats

Network parameters are calculated using the R and SNA package. In order to ascertain the maximum compatibility with other studies, the values were compared to those calculated by one of the most popular network analysis packages: UCInet.

Egocentric networks are defined as networks containing a single actor (ego), all the actors (alters) that an ego is connected to, and all the links between the alters.

Variable nameDescriptionDetails
useridUnique user identifier
Network_sizeNetwork sizeNetwork size is the total number of people in the egocentric network, including ego
Network_size_incNetwork size from other databaseDon't worry about this one
betweenness Ego betweennessBetweenness centrality of an ego can be defined as the extent to which an ego lies between alters within the network (Freeman, 1979). Ego betweenness is high when alters are not interconnected well, and thus many of the shortest paths run through ego. = betweenness(d, gmode=“digraph”, nodes= ego)/2, where betweenness() is a function from the sna package for R and d is the egocentric network digraph.
n_betweennessNormalized ego betweenness As ego betweenness is related to the size of the network, it should be normalized in order to allow for comparisons between egocentric networks of different sizes. Normalization used here involves dividing betweenness by the number of all possible pairs between alters (this method is also employed in the UCInet package). (Add graph showing the relation between betweenness and size and normalized betweenness and size) = (betweenness*2)/(n*(n-1)), where n is the number of alters in the egocentric network without the ego (network_size-1).
DensityDensityDensity indicates how many connections (edges) there are between alters, as compared to the maximum possible number of edges. For an undirected egocentric graph, it is calculated by dividing the total number of (edges) by the maximum possible number of edges. The density score here can be slightly different to the one provided by UCInet, as it is being calculated for the whole ego network including ego (as opposed to calculating the density in the egocentric network with ego removed as it is being done in UCInet). = gden(d, mode=“digraph”, ignore.eval=TRUE), where gden() is a function from the sna package for R and d is the egocentric network.
brokerageBrokerageThis is the number of alters' pairs that are not directly connected. = n*(n-1)-(n_diads-n), where n is the number of alters in the egocentric network without the ego (network_size-1); n*(n-1) is the number of all possible diads that do not include ego; and n_diads is the actual number of diads in the network.
nbrokerageNormalized brokerageAs brokerage also depends on the size of the network, it is being normalized by dividing it by the number of all possible pairs between alters (equals to brokerage/(n*(n-1)), where n is the number of alters in the egocentric network without the ego (network_size-1).
transitivityTransitivity Estimated using gtrans() function from the sna R package; gtrans(network, mode=“digraph”, measure = “weak”)

Note: you can also use the n_diads variable from the Facebook Activity table to estimate the size of the social network. Please note, however, that these values were recorded even for people observed within the networks of others (i.e. we do not have all the information about their social networks).

While using network variables in regressions, it is worth transforming them using log(degree) and log(density).

Personality Scores

Collected using:

  • 20–100-item IPIP proxy for Costa and McCrae's NEO-PI-R domains (Five Factor Model)
    • The most popular and widely accepted personality questionnaire at the moment.
    • 8 length versions: 20 to 100 items (around 40% of the users take the 100-item version). Users either decide the length of the questionnaire that they want to take in advance, or they can take extra questions in blocks of 10 until they have finished all 100 items.
    • Respondents are taking the test to get feedback, so they are quite well motivated, which results in high accuracy (reliability >0.8, better than in most supervised pen-and-paper applications of the same measure).
  • 336-item IPIP proxy for Costa and McCrae's NEO-PI-R facets (FFM)
    • Very detailed personality questionnaire (IPIP version of the famous NEO-PI-R).
    • 336 items, 30 scales (the test is composed of 300 IPIP-NEO facets scales items and 100 IPIP-NEO domains scales items; hence those scales share 64 items, and the whole test contains 336 items).
    • Very high accuracy (reliability around 0.7–0.9).
    • Highly motivated respondents – in order to take the test, respondents must earn “credits” by participating in some other (and less attractive) experiments, or by paying a small sum of money (around 4$).
    • Around 7,000 scores.

More details about IPIP domains.

Data pre-processing: People's responses may contain some noise and unreliable information – e.g. users that selected “Agree” in all of the 100 questions are still getting a score from the system, although it is quite clear that this score is not very valid. Hence, we remove around 3% of the protocols (questionnaires filled by the individuals) before we publish them here. You can read more about how unreliable protocols are removed in this report by Michal Kosinski.

List of questions.

BIG5 Scores only

Variable nameDescriptionDetails
useridUnique user identifier
opeIPIP-NEO Openness
conIPIP-NEO Conscientiousness
extIPIP-NEO Extroversion
agrIPIP-NEO Agreeableness
neuIPIP-NEO Neuroticism
item_level0/1 whether we have item-level data for the IPIP-NEO domains personality
blocksLength of the IPIP-NEO domains questionnaire (20-336)
dateDate taken

BIG5 Facet scores

Variable nameDescriptionDetails
useridUnique user identifier
EneofacetsExtroversion Calculated with the items from the 100-item IPIP NEO-PI-R Measure; question 69 is recoded, as it is used as a facet item for N4
Nneofacets Neuroticism Calculated with the items from the 100-item IPIP NEO-PI-R Measure
Oneofacets Openness Calculated with the items from the 100-item IPIP NEO-PI-R Measure
Aneofacets Agreeableness Calculated with the items from the 100-item IPIP NEO-PI-R Measure
Cneofacets Conscientiousness Calculated with the items from the 100-item IPIP NEO-PI-R Measure
In brackets — reliability reported on IPIP website Here: our Cronbach's Alpha reliability
A1 TRUST (.82)0.90
A2 MORALITY (.75)0.79
A3 ALTRUISM (.77)0.85
A4 COOPERATION (.73)0.71
A5 MODESTY (.77)0.79
A6 SYMPATHY (.75)0.81
C1 SELF-EFFICACY (.78)0.84
C2 ORDERLINESS (.82)0.84
C3 DUTIFULNESS (.71)0.77
N1 ANXIETY (Alpha = .83)0.87
N2 ANGER (.88)0.92
N3 DEPRESSION (.88)0.91
O1 IMAGINATION (.83)0.84
O5 INTELLECT (.86)0.83
O6 LIBERALISM (.86)0.80

Reliability coefficients calculated using 10.2010 snapshot.

Facebook Status Updates

The LIWC dataset should meet the needs of many researchers. The Linguistic Inquiry Word Count (LIWC) program measures 64 linguistic and psychological processes, personal concerns, and spoken categories. Each of the variables in this dataset is listed below. Additional information (including alphas) is available from the LIWC website and the LIWC manual.

This analysis was conducted by splitting the user_status.csv file into a separate text file for each user, containing all of that user's status updates recorded by myPersonality. LIWC was then run on each file. The resultant CSV file contains a userID field, which can be used to merge the data with other myPersonality databases. All other variables correspond to the categories listed in the table below.

The value of each cell in the database represents the percentage of words in all of each user's status updates that fall into each category. NOTE: Because some words fall into multiple categories, percentages will often sum to more than 100%.

Feel free to direct any questions about this database to Sean Rife.

CategoryVariable NameExamples
Linguistic Processes
Total function wordsfunct
Total pronounspronounI, them, itself
Personal pronounsppronI, them, her
1st pers singulariI, me, mine
1st pers pluralweWe, us, our
2nd personyouYou, your, thou
3rd pers singularsheheShe, her, him
3rd pers pluraltheyThey, their, they’d
Impersonal pronounsipronIt, it’s, those
ArticlesarticleA, an, the
Common verbsverbWalk, went, see
Auxiliary verbsauxverbAm, will, have
Past tensepastWent, ran, had
Present tensepresentIs, does, hear
Future tensefutureWill, gonna
AdverbsadverbVery, really, quickly
PrepositionsprepTo, with, above
ConjunctionsconjAnd, but, whereas
NegationsnegateNo, not, never
QuantifiersquantFew, many, much
NumbersnumberSecond, thousand
Swear wordsswearDamn
Psychological Processes
Social processessocialMate, talk, they, child
FamilyfamilyDaughter, husband, aunt
FriendsfriendBuddy, friend, neighbor
HumanshumanAdult, baby, boy
Affective processesaffectHappy, cried, abandon
Positive emotionposemoLove, nice, sweet
Negative emotionnegemoHurt, ugly, nasty
AnxietyanxWorried, fearful, nervous
AngerangerHate, kill, annoyed
SadnesssadCrying, grief, sad
Cognitive processescogmechCause, know, ought
InsightinsightThink, know, consider
CausationcauseBecause, effect, hence
DiscrepancydiscrepShould, would, could
TentativetentatMaybe, perhaps, guess
CertaintycertainAlways, never
InhibitioninhibBlock, constrain, stop
InclusiveinclAnd, with, include
ExclusiveexclBut, without, exclude
Perceptual processesperceptObserving, heard, feeling
SeeseeView, saw, seen
HearhearListen, hearing
FeelfeelFeels, touch
Biological processesbioEat, blood, pain
BodybodyCheek, hands, spit
HealthhealthClinic, flu, pill
IngestioningestDish, eat, pizza
RelativityrelativArea, bend, exit, stop
MotionmotionArrive, car, go
SpacespaceDown, in, thin
TimetimeEnd, until, season
Personal Concerns
WorkworkJob, majors, xerox
AchievementachieveEarn, hero, win
LeisureleisureCook, chat, movie
HomehomeApartment, kitchen, family
MoneymoneyAudit, cash, owe
ReligionreligAltar, church, mosque
DeathdeathBury, coffin, kill
Spoken Categories
AssentassentAgree, OK, yes
NonfluenciesnonfluEr, hm, umm
FillersfillerBlah, Imean, youknow


Here's a beautiful database of couples, featuring the userIDs of two partners (merge with other tables if you need to), their locations, distance in km, and overlap in photo_tags, likes, groups and—most importantly—number of shared friends.

Variable nameDescriptionDetails
useridpartner1's Unique user identifier
significant_other_idpartner2's userID
relationship_statusRelationship statusValues: 1 = 'Single'; 2 = 'In a Relationship'; 3 = 'Married'; 4 = 'Engaged'; 5 = 'It's Complicated'; 6 = 'In an Open Relationship'; 7 = 'Widowed'; 8 = 'Divorced'; 9 = 'Separated'; 10 = 'In a domestic partnership'; 11 = 'In a civil union'; 12 = 'Hooked'
interested_inInterested In1 = Male, 2 = Female, 3 = Female, Male
lat1latitude of the user's location
lon1longitude of the user's location
lat2latitude of the user's location
lon2longitude of partner's location
distancedistance in kmLatitude_and_longitude
s_friendsnumber of shared friends Computed using Facebook friendship DIADS database
n_diads1 number of partner1's friends that we have in the dbUseful for interpreting the number of shared friends
s_groupnumber of shared facebook groups
n_group1number of all the groups of partner1
s_likenumber of shared facebook likes
n_like1 number of all the likesof partner1
s_tags number of all the photos on which partner1 is tagged
n_tags1all tags of partner1

FIXME To be added in the future: the correlation between like-dimensions.

mySQL table definition for couples table


This unique database of >3.5 mln triads contains detailed descriptions of the nodes and edges, and can be easily combined with all the other databases that are available on the myPersonality project website.

Note that the set can also be used to study dyads by ignoring the third actor. Also note that you can easily add more variables by simply combining the triads database with our other databases.

TRANSITIVITY: Apart from studying the properties of nodes and dyads, transitivity and brokerage are some of the most interesting facets on the triad level. Transitivity refers to the “closedness” of the triad. In transitive triads, all three actors are Facebook friends whereas in intranstive triads, one of these friendships is absent. The individual who is connected to both alters that have not formed a link among themselves holds the brokerage position—a powerful position that can bridge social groups and boundaries.

The database is specifically structured to allow for the study of transitivity. For all intransitive triads, it is clear who occupies the broker position. This individual is marked as broker with both alters marked as friend1 and friend2. Transitive triads were “rotated” 2 times, such that all 3 nodes are on the broker's positions—consider selecting one of those triads only in your studies.

The triads database contains the following data:

    • Whether the triad is transitive or not
    • Big Five Personality scores
    • Geographical location: Latitude and longitude
    • Demographics: self-reported gender and age
    • Ego-network properties and size
    • Proximity measures (note that you can calculate aditional measures, e.g. using photo-tags or likes):
      • Number of shared friends as % of the network size
      • Number of shared friends
      • Geographical distance
    • Contrasts between nodes:
    • Correlation between personalities
    • Age difference

Variable nameDescriptionDetails
idTriad ID
transitiveIs the triad transitive?
brokerBroker userid
f1Friend 1 userid
f2Friend 2 userid
b1shared friends between broker and f1
b2shared friends between broker and f2
f1f2shared friends between f1 and f2
nof_brokernumber of broker's friends
nof_f1number of f1's friends
nof_f2number of f2's friends
rotatedrotation of the transitive triadEach transitive triad is present in the table 3 times, such that each of the nodes occupies broker's position.
sex_bsex broker
age_bage broker (in 2010)
O_bopenness broker
C_bconscientiousness broker
E_bextroversion broker
A_bagreeableness broker
N_bneuroticism broker
lat_blatitude broker in decimal degrees (wgs84)
lon_blongitude broker in decimal degrees (wgs84)
sex_1sex f1
age_1age f1 (in 2010)
O_1openness f1
C_1conscientiousness f1
E_1extroversion f1
A_1agreeableness f1
N_1neuroticism f1
lat_1latitude f1 in decimal degrees (wgs84)
lon_1longitude f1 in decimal degrees (wgs84)
sex_2sex f2
age_2age f2 (in 2010)
O_2openness f2
C_2conscientiousness f2
E_2extroversion f2
A_2agreeableness f2
N_2neuroticism f2
lat_2latitude f2 in decimal degrees (wgs84)
lon_2longitude f2 in decimal degrees (wgs84)
distance_bf1distance in km between broker and f1
distance_bf2distance in km between broker and f2
distance_f1f2distance in km between f1 and f2
score_bIQ score broker
score_f1IQ score f1
score_f2IQ score f2
r_sum_f1f2Ratio of shared friends between f1 and f2
r_sum_b1Ratio of shared friends between b and f1
r_sum_b2Ratio of shared friends between b and f2
gendersGenders of the actors f1 f2
big5Correlation between f1 and f2 personalities

Syntax: Some suggestions for computations: adding variables from the MPcooked database, selecting only certain triads, etc. Feel free to add your bits as well…

Snyder's Self-Monitoring, 25 items

Data is cleaned/preprocessed as described here.

Our reliability: .69

Variable nameDescriptionDetails
useridUnique user identifier
sm Self-monitoring score

Item-level data file

Variable nameDescriptionDetails
selfMonitoring_takenDate of the test
q1 to q25 Responses to individual questions Should add the actual questions here!
sm Self-Monitoring Score (sum score)
selfmon_no_ansNumber of missing answers
q1raw to q25rawraw item-level data
PrimaryFirstIs it a first approach of this given user?

Rust's Sense-of-Fairness and Impression Management, 36 items

These are the Fair-mindedness (fm_score) and Disclosure (sd_score) minor scales from the Orpheus questionnaire. Together, they assess integrity in relation to your work life:

  • (Our reliability: .75) Fair Mindedness (or Sense-of-fairness) → how balanced and impartial you are in your decision-making
  • (Our reliability: .61) Impression Management (or reversed Self Disclosure) → to what extent you conduct your life transparently
  • Please note that Impression Management is Self Disclosure reversed (i.e. 1 = 9, 2 = 8 …. 9 = 1)
  • Reference: Rust, J. & Golombok, S. (2009). Psychometric assessment of personality in occupational settings. In Modern Psychometric, Third Edition: The Science of Psychological Assessment (pp. 165-182). New York, NY: Routledge
Variable nameDescriptionDetails
useridUnique user identifier
sdfm_workAre you currently in work or have worked in the past (including a part-time job)?1 = yes, 2 = no, 0 = Missing
sd_scoreSelf Disclosure (which is reversed Impression Management) To what extent you conduct your life transparently
fm_scoreFair Mindedness (or Sense-of-fairness) scoreHow balanced and impartial you are in your decision-making

List of questions and scoring details

Satisfaction With Life Scale

  • Our reliability: .86
  • The SWLS was developed over 20 years ago to measure life satisfaction among the general population.
  • The satisfaction with life scale is from Diener (1985):
  • Throughout the period that this questionnaire has been available on myPersonality, the same 5 satisfaction with life questions have been asked by Diener (1985). However, also during the period, two different projects have been run relating SWL to other factors (with Dr Richard Tunney):
    • Between the beginning (~August 2008) and August 2010, there were additional questions concerning the friends that people have and the main activity that they do together. During this period, if a user retook the SWL questionnaire, their new results would replace their old results in the database. It is therefore possible to see how many times the user took the test, as this is stored in another table (personality.“swl_times_taken”); however, it's not possible to see the various scores that the user got.
    • From August 2010, a new project included additional questions concerning the significant life events that people have experienced; how happy or unhappy each event made them; and how kind/generous they consider themselves to be (IPIP VIA Kindness/Generosity). At this point, the questionnaire was modified so that the old results would be stored as well as new results.
Variable nameDescriptionDetails
useridUnique user identifier
SWLSatisfaction With Life Score

Item-level data

Variable nameDescriptionDetails
useridUnique user identifier
q1In most ways, my life is close to ideal
q2The conditions of my life are excellent
q3I am satisfied with my life
q4So far I have gotten the important things I want in life
SWL taken Date of the test
SWL Score
PrimaryFirst 1 = score retained as the main score for given user

Facebook Likes Reduced Data

LSI/SVD reduction

Given the original user-like binary matrix, with rows representing likes and columns representing users, we applied latent semantic indexing, such that as a result, each user is represented as a linear combination of “concepts”—or as a vector in concept space.

When applying LSI, each entry in the original matrix is usually given by product of “local” information about term in document and “global” information about the same term in the whole collection, with the goal to reduce impact of words not bringing much of information - or noise words. The common problem is to choose these weights. We decided to choose weight functions that give the best prediction results on age, gender and big five personality traits:

  • Binary as local weight function
  • Sequentially (exactly in that order): Normal and Entropy as global weight functions

Please see:

LDA topical representation

Another approach to dimensionality reduction we serve is the result of Latent Dirichlet Allocation. We treat each user as a document containing words from “dictionary of likes” and we find topical decomposition of users.

Each user is then represented as a weighted combination of topics. The difference between LSI and LDA is that topics discovered by LDA seem to be interpretable as a specific political view, sexual orientation, religion, movies/books, taste etc. Thus a user can be expressed as a weighted combination of certain nameable “aspects,” which gives an additional layer of possible interpretation.

We choose 600 topics as they seem to be interpretable + the accuracy of the models predicting age, gender and big five is the highest for that choice of the parameter K.

We provide a set of users being represented as a mixture of LDA topics + a set of topics for the purposes of further results interpretations. Have fun!

Schwartz's Values Survey

Here is the list of items Here is how to score the scale and use those scores in analyses

Usually in the myPersonality data, 0 means that the participant did not submit a screen with that item on it (you'll see a whole row of 0s), and -1 means that they submitted a screen with that item on it but left that item blank (you'll see -1 anywhere in a dataset).

Since 0 and -1 are reserved, the rest of the SVS values have been increased so that -1 is scored as a 1, 0 = 2, … , 7 = 9. To clarify, 1 = “Opposed to my values” and 9 = “Of supreme importance.”

Job Self Efficacy Scale

Variable nameDescriptionDetails
useridUnique user identifier
q1I can successfully overcome obstacles at work
q2I can effectively handle difficult tasks at work
q3I have no problem meeting the expectations that my employer has for me
q4I can successfully organize and prioritize my duties at work
q5When at work, I am able to give full attention to my assignments
q6I am confident in my ability to meet most deadlines on my job
q7I am able to solve most work problems in a timely fashion
q8I am more capable at doing my job than most other employees

From Chen, G., Goddard, T. G., & Casper, W. J. (2004). Examination of the relationships among general and work-specific self-evaluations, work-related control beliefs, and job attitudes. Applied Psychology: An International Review, 53, 349-370

list_of_variables_available.txt · Last modified: 2017/12/08 14:47 by Michal Kosinski