The name of this technique can be quite misleading for someone unfamiliar with research methods. Contradictory to its name, content analysis technique does not examine the meanings of individual messages. Rather, it analyzes the way the messages have been delivered. Content analysis is a systematic, objective and quantitative analysis of message characteristics . Content analysis can be used to examine any piece of writing or occurrence of recorded communication. Utilizing the Meaning Extraction Method, combined with the content analysis, can provide the major themes occurring within the sample data, based on the co-occurrence of high frequency content words . Analyzing vast amounts of data is both challenging and time consuming. It is important to note that the presentation of information is just as, if not more, important then the message itself. Furthermore, even subtle changes in presentation provide valuable insights into message trends that are vital to the overall analysis.
For example, intelligence analysts can apply content analysis to current political speeches comparing those targeted at domestic and international audiences. Or they can analyze propaganda for hidden policy objectives or even fears within the leadership or bureaucracy. In Lying words: Predicting deception from linguistic style study, the researchers examined numerous truthful statements and lies. The final data suggested that truth-tellers used more 1st person singular words and fewer negative emotion words. The Do bilinguals have two personalities? A special case of frame switching study suggested that when bilinguals switch languages, their personality changes. In the Content Analysis of Jihadi Extremist Groups’ Videos , the researchers analyzed the data to see how terrorist’s groups campaigns and mode of operation changes over time to help suggest counterintelligence strategies.
Wednesday, May 9, 2007
Strengths
Very Discrete: Content analysis technique allows the analyst to examine social interactions based on texts or transcripts. The subject does not know that the analyst is examining his/her messages; therefore, there is no chance of alteration before the review of the subject’s presentation.
Historical outline: The changes of approach to presentation of information become clear as time elapses and more data is brought into play. The more data you collect and the longer you collect it, the better your analysis becomes.
Easy to Update: Once the study is set up, it is fairly easy to add additional transcripts into the database.
Deception: Any drastic deviations from the established norm without plausible explanation could serve as indicators of hidden activities. Such indicators can help target collectors to search for specific information filling such knowledge gaps.
Accountability/Oversight: A well-designed study can be an extremely helpful tool for the accountability and oversight of the analysis. It would provide the team leaders and the decision makers with a comprehensive way to follow up on the logic and execution of the study, while allowing the analyst to justify his/her conclusions.
Historical outline: The changes of approach to presentation of information become clear as time elapses and more data is brought into play. The more data you collect and the longer you collect it, the better your analysis becomes.
Easy to Update: Once the study is set up, it is fairly easy to add additional transcripts into the database.
Deception: Any drastic deviations from the established norm without plausible explanation could serve as indicators of hidden activities. Such indicators can help target collectors to search for specific information filling such knowledge gaps.
Accountability/Oversight: A well-designed study can be an extremely helpful tool for the accountability and oversight of the analysis. It would provide the team leaders and the decision makers with a comprehensive way to follow up on the logic and execution of the study, while allowing the analyst to justify his/her conclusions.
Weaknesses
Extremely time consuming: It can take a substantial amount of time to design and execute the study, which is unavailable in a crisis and quite inconvenient for quick analysis. Inexperienced analysts should not be allowed to use this technique due to time required for training.
Coding: Human coding is inconsistent and prone to discrepancies, while creating an initial dictionary for computer coding can take years even for a highly trained professional.
Vulnerable to bias: There is no manual or set guidelines available on how to evaluate the results; therefore, the analysis is vulnerable to subjective interpretations. An inexperienced analyst without an in-depth understanding of the target will not be able to draw useful conclusions from the data.
Not always comprehensive: In some cases, it is simply impossible to incorporate all the facts or know if analyst included enough data into the study.
Coding: Human coding is inconsistent and prone to discrepancies, while creating an initial dictionary for computer coding can take years even for a highly trained professional.
Vulnerable to bias: There is no manual or set guidelines available on how to evaluate the results; therefore, the analysis is vulnerable to subjective interpretations. An inexperienced analyst without an in-depth understanding of the target will not be able to draw useful conclusions from the data.
Not always comprehensive: In some cases, it is simply impossible to incorporate all the facts or know if analyst included enough data into the study.
How-To
In order to be considered solid scientific research, content analysis should be conducted in accordance with thorough procedures and guidelines. Researchers should follow nine essential steps below to successfully utilize the content analysis technique. The outline and description of those steps, with minor adjustments, were taken directly from Kimberly A. Neuendorf’s The Content Analysis Guidebook :
1. Theory and Rationale: What content will be examined, and why? Are there certain theories or perspectives that indicate that this particular message content is important to study? (Is there a difference in the way Putin addresses the domestic and the international populace? Does the way he communicates change over time?) Library work is needed here to conduct a good literature review. Will you be using an integrative mode, linking content analysis with other data to show relationships with source or receiver characteristics? Do you have research questions? Hypotheses?
2. Conceptualization: What variables will be used in the study, and how do you define them conceptually (i.e., with dictionary-type definitions)? Remember, you are the boss! There are many ways to define a given construct, and there is no one right way. You may want to screen some examples of the content you’re going to analyze, to make sure you’ve covered everything you want.
3. Operationalizations (measures): Your measures should match your conceptualizations (this is called internal validity). What unit of data collection will you use? You may have more then one unit (e.g., a by-utterance coding scheme and a by-speaker coding scheme). Are the variables measured well (i.e., at a high level of measurement, with categories that are exhaustive and mutually exclusive)? An a priori coding scheme describing all measures must be created. Both face and content validity may also be assessed at this point.
4. Coding: You will need to select the type of coding you are going to use. Two options are available:
Human Coding
4a. Coding schemes: You need to create the following materials:
a. Codebook (with all variable measures fully explained)
b. Coding form
Computer Coding
4b. Coding schemes: With computer text content analysis, you still need a codebook of sorts-a full explanation of your dictionaries and method of applying them. You may use standard dictionaries (i.e., those in Hart’s program, Diction) or originally created dictionaries. When creating custom dictionaries, be sure to first generate a frequencies list from your text sample and examine for key words and phrases.
5. Sampling: Is a census of the content possible? (If yes, go to #6, if no, go to step 7b) How will you randomly sample a subset of the content? This could be by time period, by issue, by page, by channel, and so forth.
6. Training and pilot reliability: During a training session in which coders work together, find out whether they can agree on the coding of variables. Then, in an independent coding test note the reliability on each variable. At each stage, revise the codebook or coding form as needed.
7. Coding: Based on the above sampling (step 5) select the type of coding you are going to use. Two options are available:
Human Coding
7a. Coding: Use at least two coders to establish inter-coder reliability. Coding should be done independently, with at least 10% overlap for the reliability test.
Computer Coding
7b. Coding: Apply dictionaries to the sample text to generate per-unit (e.g., per-news-story) frequencies for each dictionary. Do some spot check for validation. Skip step 8.
8. Final reliability: Calculate a reliability figure (percent agreement, Scott’s pi, Spearman’s rho, or Pearson’s r, for example) for each variable.
9. Tabulation and reporting: See various examples of content analysis results to see the way in which results can be reported. Figures and statistics may be reported one variable at a time (univariate), or variables may be cross-tabulated in different ways (bivariate and multivariate techniques). Over-time trends are also a common reporting method. In the long run, relationships between content analysis variables and other measures may establish criterion and construct validity.
1. Theory and Rationale: What content will be examined, and why? Are there certain theories or perspectives that indicate that this particular message content is important to study? (Is there a difference in the way Putin addresses the domestic and the international populace? Does the way he communicates change over time?) Library work is needed here to conduct a good literature review. Will you be using an integrative mode, linking content analysis with other data to show relationships with source or receiver characteristics? Do you have research questions? Hypotheses?
2. Conceptualization: What variables will be used in the study, and how do you define them conceptually (i.e., with dictionary-type definitions)? Remember, you are the boss! There are many ways to define a given construct, and there is no one right way. You may want to screen some examples of the content you’re going to analyze, to make sure you’ve covered everything you want.
3. Operationalizations (measures): Your measures should match your conceptualizations (this is called internal validity). What unit of data collection will you use? You may have more then one unit (e.g., a by-utterance coding scheme and a by-speaker coding scheme). Are the variables measured well (i.e., at a high level of measurement, with categories that are exhaustive and mutually exclusive)? An a priori coding scheme describing all measures must be created. Both face and content validity may also be assessed at this point.
4. Coding: You will need to select the type of coding you are going to use. Two options are available:
Human Coding
4a. Coding schemes: You need to create the following materials:
a. Codebook (with all variable measures fully explained)
b. Coding form
Computer Coding
4b. Coding schemes: With computer text content analysis, you still need a codebook of sorts-a full explanation of your dictionaries and method of applying them. You may use standard dictionaries (i.e., those in Hart’s program, Diction) or originally created dictionaries. When creating custom dictionaries, be sure to first generate a frequencies list from your text sample and examine for key words and phrases.
5. Sampling: Is a census of the content possible? (If yes, go to #6, if no, go to step 7b) How will you randomly sample a subset of the content? This could be by time period, by issue, by page, by channel, and so forth.
6. Training and pilot reliability: During a training session in which coders work together, find out whether they can agree on the coding of variables. Then, in an independent coding test note the reliability on each variable. At each stage, revise the codebook or coding form as needed.
7. Coding: Based on the above sampling (step 5) select the type of coding you are going to use. Two options are available:
Human Coding
7a. Coding: Use at least two coders to establish inter-coder reliability. Coding should be done independently, with at least 10% overlap for the reliability test.
Computer Coding
7b. Coding: Apply dictionaries to the sample text to generate per-unit (e.g., per-news-story) frequencies for each dictionary. Do some spot check for validation. Skip step 8.
8. Final reliability: Calculate a reliability figure (percent agreement, Scott’s pi, Spearman’s rho, or Pearson’s r, for example) for each variable.
9. Tabulation and reporting: See various examples of content analysis results to see the way in which results can be reported. Figures and statistics may be reported one variable at a time (univariate), or variables may be cross-tabulated in different ways (bivariate and multivariate techniques). Over-time trends are also a common reporting method. In the long run, relationships between content analysis variables and other measures may establish criterion and construct validity.
Personal Application
Under the rule of Vladimir Putin, the Russian Federation went through a unique transformation designed to negate the chaotic reforms of Boris Yeltsin, while reestablishing Russia’s role as a vital player within the international community. Since Putin’s speeches are readily available on the Kremlin’s website, I decided to use the Content Analysis Technique to examine their texts to see if any patterns or trends would emerge. As I did not have any previous research experience, I chose to use the Computerized Text Analysis of Al-Quaeda Transcripts research paper by James W. Pennebaker and Cindy K. Chung as a guide. On the personal note, I would like to thank Cindy K Chung for her continued help and clarification of the process throughout my experiment. I would have never been able to complete it without her help.
While I was trying to educate myself about the content analysis technique and all of the available software, I realized that I would not be able to do so in such a short time without the outside help. I was researching the leading authors of the text books published over the last few years, when I came across Content Analysis; An Introduction to its Methodology book by Klaus Krippendorff. I e-mailed Professor Krippendorff asking him for some guidance on selecting an appropriate software for my research without any hope to hear back. I was extremely surprised to get a reply within a day, which not only included a recommendation for LIWC software, but also a copy of the unpublished article utilizing that program for a computerized content analysis. I was thrilled! Every person I contacted from this point on was extremely eager to help. I would encourage anyone trying to replicate any of the content analysis studies to contact their original authors for clarification and guidance.
I used the Computerized Text Analysis of Al-Quaeda Transcripts research paper by James W. Pennebaker and Cindy K. Chung that Professor Krippendorff provided as a guide for my own research. Unfortunately, since I only had 10 weeks to learn about this technique, I was unable to replicate the entire study. I did however learn how to use both LICW and Hermetic Word Frequency counter Software.
I tried to create my own dictionary specifically for this study utilizing the Hermetic Word Frequency Counter Software, unfortunately it proved to be extremely impractical and time consuming. Normally, creation of a dictionary requires an analyst to generate a word frequency list and then code all of the words according to the desired categories. This requires hours of human coding and is extremely impractical for a quick analysis.
Due to the time constraints, I chose to conduct a computerized content analysis and used the standard dictionary that came with the software. Linguistic Inquiry and Word Counting software (LIWC2001) proved to be an inexpensive and valuable tool. LIWC was designed to analyze single files or groups of files of text, matching each word to the appropriate word category scale in its standard dictionary. The output file produced up to 84 output variables for each text file. The data included 17 standard linguistic dimensions (e.g., word count, percentage of pronouns, articles), 25 word categories tapping psychological constructs (e.g., affect, cognition), 10 dimensions related to "relativity" (time, space, motion), and 19 personal concern categories (e.g., work, home, leisure activities). LIWC Dictionary was composed of 2,290 most used words and word stems. Each word was assigned one or more word categories or sub-dictionaries in order to capture its essence.
Since the Kremlin’s database is not designed to allow for searches on specific subjects (i.e., nuclear or energy industry), I decided to pull the samples based on two distinctly different audiences. My ultimate goal was to see if there is a difference in the way Vladimir Putin addresses domestic and international audiences and if the way he communicate changes over time.
In order to analyze data in regards to the domestic audience, I downloaded all of Putin’s annual addresses to the Federal Assembly over the past eight years. In order to evaluate his speeches to the international audience, I took a sample of his press statements from the Kremlin database under the Diplomacy and External Affairs section. While my original goal was to download all of the speeches under the Diplomacy and External Affairs section, I realized that it was simply unrealistic for me to process such large amount of data in a very short time.
The text files were cleaned in order to convert spelling of the words from British English to American English. The final sample consisted of 8 Annual Addresses and 26 press statements from the Diplomacy and External Affairs section. The LIWC then calculated the percentage of words within 84 categories that were used within any given speech. After the program processed all of the data and saved the output in the Excel format, I calculated the averages for each category. I then compared all of the date to the established average to see what variables were outside of the “norm”.
I was extremely frustrated that there were no set guidelines on how to evaluate the data, which became a substantial drawback to utilizing this technique. However, after reviewing all of the data, I was extremely surprised with the consistency of his speeches over time within either the domestic or international data sets.
Even though I was unable to figure out the exact meaning of the data, it was very interesting to see the differences in domestic and international presentations. The most striking trend within his speeches was the use of pronouns. He used “I” almost twice as much in addressing international audiences versus the domestic populace, which would portrait his command over the subject matter. His average use of “we” was 1.5 times higher in his address to the international audience, which could be interpreted as an inclusive technique. However, the ratio of I/We for both domestic and international audiences in his speeches were approximately the same (.32 and .35 respectively), which facilitated a strong connection between him and both audiences.
Putin’s choice to use the words of positive emotions and positive feelings were roughly two times higher in addressing international audiences, while negative emotions and anger were 3 times higher in his addresses to the domestic audience. He seems to reinforce his degree of closeness to the international audiences, while simultaneously appealing to Russian xenophobia. He also referred on average to “friends” 5 times more when addressing the international community versus domestic.
There seems to be a constant trend of consistency throughout all of his speeches across all categories, which led me to believe that they all were carefully compiled to follow a certain formula. All future speeches that do not follow the same patterns should be carefully examined for possible changes in policies. It seems to me that the below image best describes my findings.
While I was trying to educate myself about the content analysis technique and all of the available software, I realized that I would not be able to do so in such a short time without the outside help. I was researching the leading authors of the text books published over the last few years, when I came across Content Analysis; An Introduction to its Methodology book by Klaus Krippendorff. I e-mailed Professor Krippendorff asking him for some guidance on selecting an appropriate software for my research without any hope to hear back. I was extremely surprised to get a reply within a day, which not only included a recommendation for LIWC software, but also a copy of the unpublished article utilizing that program for a computerized content analysis. I was thrilled! Every person I contacted from this point on was extremely eager to help. I would encourage anyone trying to replicate any of the content analysis studies to contact their original authors for clarification and guidance.
I used the Computerized Text Analysis of Al-Quaeda Transcripts research paper by James W. Pennebaker and Cindy K. Chung that Professor Krippendorff provided as a guide for my own research. Unfortunately, since I only had 10 weeks to learn about this technique, I was unable to replicate the entire study. I did however learn how to use both LICW and Hermetic Word Frequency counter Software.
I tried to create my own dictionary specifically for this study utilizing the Hermetic Word Frequency Counter Software, unfortunately it proved to be extremely impractical and time consuming. Normally, creation of a dictionary requires an analyst to generate a word frequency list and then code all of the words according to the desired categories. This requires hours of human coding and is extremely impractical for a quick analysis.
Due to the time constraints, I chose to conduct a computerized content analysis and used the standard dictionary that came with the software. Linguistic Inquiry and Word Counting software (LIWC2001) proved to be an inexpensive and valuable tool. LIWC was designed to analyze single files or groups of files of text, matching each word to the appropriate word category scale in its standard dictionary. The output file produced up to 84 output variables for each text file. The data included 17 standard linguistic dimensions (e.g., word count, percentage of pronouns, articles), 25 word categories tapping psychological constructs (e.g., affect, cognition), 10 dimensions related to "relativity" (time, space, motion), and 19 personal concern categories (e.g., work, home, leisure activities). LIWC Dictionary was composed of 2,290 most used words and word stems. Each word was assigned one or more word categories or sub-dictionaries in order to capture its essence.
Since the Kremlin’s database is not designed to allow for searches on specific subjects (i.e., nuclear or energy industry), I decided to pull the samples based on two distinctly different audiences. My ultimate goal was to see if there is a difference in the way Vladimir Putin addresses domestic and international audiences and if the way he communicate changes over time.
In order to analyze data in regards to the domestic audience, I downloaded all of Putin’s annual addresses to the Federal Assembly over the past eight years. In order to evaluate his speeches to the international audience, I took a sample of his press statements from the Kremlin database under the Diplomacy and External Affairs section. While my original goal was to download all of the speeches under the Diplomacy and External Affairs section, I realized that it was simply unrealistic for me to process such large amount of data in a very short time.
The text files were cleaned in order to convert spelling of the words from British English to American English. The final sample consisted of 8 Annual Addresses and 26 press statements from the Diplomacy and External Affairs section. The LIWC then calculated the percentage of words within 84 categories that were used within any given speech. After the program processed all of the data and saved the output in the Excel format, I calculated the averages for each category. I then compared all of the date to the established average to see what variables were outside of the “norm”.
I was extremely frustrated that there were no set guidelines on how to evaluate the data, which became a substantial drawback to utilizing this technique. However, after reviewing all of the data, I was extremely surprised with the consistency of his speeches over time within either the domestic or international data sets.
Even though I was unable to figure out the exact meaning of the data, it was very interesting to see the differences in domestic and international presentations. The most striking trend within his speeches was the use of pronouns. He used “I” almost twice as much in addressing international audiences versus the domestic populace, which would portrait his command over the subject matter. His average use of “we” was 1.5 times higher in his address to the international audience, which could be interpreted as an inclusive technique. However, the ratio of I/We for both domestic and international audiences in his speeches were approximately the same (.32 and .35 respectively), which facilitated a strong connection between him and both audiences.
Putin’s choice to use the words of positive emotions and positive feelings were roughly two times higher in addressing international audiences, while negative emotions and anger were 3 times higher in his addresses to the domestic audience. He seems to reinforce his degree of closeness to the international audiences, while simultaneously appealing to Russian xenophobia. He also referred on average to “friends” 5 times more when addressing the international community versus domestic.
There seems to be a constant trend of consistency throughout all of his speeches across all categories, which led me to believe that they all were carefully compiled to follow a certain formula. All future speeches that do not follow the same patterns should be carefully examined for possible changes in policies. It seems to me that the below image best describes my findings.
Subscribe to:
Posts (Atom)