At the transcription workshop last month, Jens Normann Jørgensen showed some graphics which mapped the development of a bilingual conversation over time, and they struck me as a very interesting way of trying to grasp the overall profile of the conversation. This inspired me to see if I could use R to produce something similar for the ESRC corpora.
Normann categorised the utterances in his conversations into 5 groups (Danish utterances with no loan, Danish-based utterances with loans, code-switching utterances and other utterances, Turkish-based utterances with loans, and Turkish utterances with no loan), to give profiles like this:
For further details, see pp320ff of:
J. N. Jørgensen (2008): Languaging: Nine years of poly-lingual development of young Turkish-Danish grade school students (2 volumes), Copenhagen Studies in Bilingualism, Volumes 15 and 16, Copenhagen.
The approach I have used is altogether more basic – I calculate the number or percentage of words in each utterance that are tagged in the transcript as belonging to a specific language, and then graph each utterance as a vertical line, with the numbers or percentages stacked.
In the profiles below, blue is Spanish or Welsh, yellow is English, and purple is undetermined (ie the item occurs in a printed dictionary for more than one language). First, the profiles based on numbers (with the scale on the y-axis):
These two conversations are very different: in sastre1 (from the Miami corpus), the speakers move into and out of both Spanish and English; in stammers4 (from the Welsh Siarad corpus), the conversation is almost monolingual, except for a few English words, and there is a greater use of indeterminate items.
The profiles based on percentages make the general “texture” of the conversation clearer, but overemphasise indeterminate items, since they give a one-word utterance the same prominence as a multiword utterance: