Size matters when it comes to corporations. In 220 million words of text, the corpus used to create the second edition of the COBUILD dictionary in 1995 was ten times larger than the size used for the first edition, and 220 times larger than the first electronic corpora developed in the 1960s. In the early 1970s. Yet it was smaller than the ones we use today, some of which amount to billions, not millions of words.
To estimate the amount of information involved: Suppose you are compiling a medium-frequency verb, such as a proclamation. The British National Corps (BNC), which was fixed at about 100 million words in 1993 and has not been expanded since, has only over 1,000 results (known as ‘citations’) for the announcement. . To look at each of these quotes and to give a good account of the meaning and behavior of the word in the dictionary entry (or elsewhere), the time and necessary expertise is possible.
But what about high frequency words like tech (174,000 quotes in BNC) or hand (less than 50,000 only)? In today’s giant corpora the numbers are much higher: the most commonly used words are millions of quotes, while some (and, hers, hers and so on) number in the millions. Even relatively unusual words can contain thousands of quotes.
Fortunately there are software tools and other methods to efficiently extract the information that the corporation has. Modern corpus search software gives a holistic picture of a word by displaying it on the screen in such a way that it shows how it matches with other words.
It shows the search term with its colloquets – the words it associates most often – and tells you how important these combinations are. Each colloquial or grammatical section displayed can be expanded, you can examine it in more detail if necessary.
The second essential tool in Lexicographer’s arsenal is the sample. It was one of the insights of Professor John Sinclair, the founder of COBUILD, that you can tell a lot about the meaning and behavior of a word from a small representative sample of corpus quotes: in many cases one screenful or two is enough.
Therefore, a combination of an observation of the colloquial and grammatical behavior of a word, along with a more detailed look at a small sample of lines, usually provides enough information to compile a new entry or modify an existing one. Is.
It is worth remembering that without the rapid computer connection the corporation could not have expanded into its current gigantic size.
In my early days as a freelancer, when dial-up was the only type of connection widely available, you could literally start a corpus search for a recurring term, go away and make a cup of tea, and search Could come back to find is still running. Today, with high-speed broadband, even repeated word searches yield results in a matter of seconds.
Corpora is used today in many different ways for different purposes on different dictionary projects. In its most basic form, a fund can provide authentic examples of how a word is used.
At the other end of the scale, detailed corpus analysis continues to reveal new and surprising information about the colloquial and grammatical behavior of the most familiar words.
As new ways of using language come into being, a regularly updated fund allows us to keep an eye on them. While the way corpora has been built and used over the last thirty years, it has become more or less unimaginable to compile or modify a dictionary without reference to the evidence provided by a corpus.