Jump to content

C (programming language)

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Dcoetzee (talk | contribs) at 04:43, 6 April 2005 (Types: Fix arrays changes; see talk page). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
File:K&R C.jpg
The C Programming Language, Brian Kernighan and Dennis Ritchie, the original edition that served for many years as an informal specification of the language

The C programming language is a low-level standardized programming language developed in the early 1970s by Ken Thompson and Dennis Ritchie for use on the UNIX operating system. It has since spread to many other operating systems, and is one of the most widely used programming languages. C is prized for its efficiency, and is the most popular programming language for writing system software, though it is also used for writing applications. It is also commonly used in computer science education, despite not being designed for novices.

Features

Overview

C is a relatively minimalist programming language that operates close to the hardware, and is more similar to assembly language than most other programming languages. Indeed, C is sometimes referred to as "portable assembly," reflecting its important difference from assembly languages: C code can be compiled for and run on almost any machine, more than any other language in existence, while assembly languages run on at most a few very specific models of machines. C is typically called a low level or medium level language, indicating how closely it operates with the hardware.

This is no accident; C was created with one important goal in mind: to make it easier to write large programs with fewer errors in the procedural programming paradigm, but without putting a burden on the writer of the C compiler, who is encumbered by complex language features. To this end, C has the following important features:

Some features that C lacks that are found in other languages include:

Although the list of useful features C lacks is long, this has not been important to its acceptance, because it allows new compilers to be written quickly for it on new platforms, and because it keeps the programmer in close control of what the program is doing. This is what often allows C code to run more efficiently than many other languages. Typically only hand-tuned assembly language code runs more quickly, since it has complete control of the machine, but advances in compilers along with new complexity in modern processors have quickly narrowed this gap.

One consequence of C's wide acceptance and efficiency is that the compilers, libraries, and interpreters of other higher-level languages are often implemented in C.

"Hello, World!" example

The following simple application appeared in the first edition of K&R, and has become a standard introductory program in most textbooks on C. The program prints out "Hello, World!" to standard output (which is usually the screen, but might be a file or some other hardware device or perhaps even the bit bucket depending on how standard output is mapped at the time the program is executed).


main()
{
    printf("Hello, World!\n");
}

Although the above program will compile correctly under most modern compilers when invoked in a non-conforming mode, it now produces several warning messages when compiled with a compiler that conforms to the ANSI C standard. (Additionally, the code will not compile if the compiler strictly conforms to the C99 standard, as a return value of type int will no longer be inferred if the source code has not specified otherwise.) These messages can be eliminated with a few minor modifications to the original program:


#include <stdio.h>

int main(void)
{
    printf("Hello, World!\n");

    return 0;
}

The first line of the program is an #include preprocessing directive, which causes the compiler to substitute for that line the entire text of the file (or other entity) it refers to; in this case the standard header stdio.h will replace that line. The angle brackets indicate that the stdio.h header is to be found in whatever place is designated for the compiler to find standard headers.

The next (non-blank) line indicates that a function named "main" is being defined; the main() function is special in C programs, as it is the function that is first run when the program starts (for hosted implementations of C, and leaving aside "housekeeping" code). The curly brackets delimit the extent of the function. The int defines "main" as a function that returns, or evaluates to, an integral number; the void indicates that no arguments or data must be given to function main by its caller.

The next line "calls" or executes, a function named printf; the included header, stdio.h, contains information describing how the printf function is to be called. In this call, the printf function is passed a single argument, the constant string "Hello, World!\n"; the sequence \n is translated to a "newline" character, which when displayed causes a line break. printf returns a value, an int, but since it is not used it is discarded quietly.

The return statement tells the program to exit the current function (in this case main), returning the value zero to the function that called the current function. Since the current function is main, the caller is whatever started our program. Finally, the close curly bracket indicates the end of the function main.

Comment text

Note that text surrounded by /* and */ (comment text) is ignored by the compiler. C99-compliant compilers also allow comments to be introduced with //, indicating that the comment extends to the end of the current line.

Types

C has a type system similar to that of other ALGOL descendants such as Pascal. There are types for integers of various sizes, both signed and unsigned, floating-point numbers, characters, enumerated types (type enum), and records (type struct). In addition, C offers the union type, which represents data in the same portion of memory as multiple data types.

C makes extensive use of pointers, a very simple type of reference that stores the address of a memory location. Pointers can be dereferenced, which causes them to refer to the data stored at their address, rather than the address itself. The address can be manipulated with regular assignment and pointer arithmetic. At runtime, a pointer represents a memory address. At compile-time, it is a complex type that represents both the address and the type of the data. This allows expressions including pointers to be type-checked. Pointers are used for many different purposes in C. Text strings are commonly represented with a pointer to an array of characters. Dynamic memory allocation, which is described below, is performed using pointers.

There also exist null pointers whose memory address is equal to the NULL value. Null pointers literally point to nothing, and are useful in creating data structures. The behavoir of dereferencing a null pointer is undefined. Pointers of type void also exist, but they refer to a an address that is not NULL. Unlike standard pointers, void pointers do not have any data type associated with them.

Arrays types in C are of a fixed, static size known at compile-time; this isn't too much of a hinderance in practice, since one can allocate blocks of memory at runtime using the standard library and treat them like arrays. Specific elements of an array are referenced using the syntax array_name[index_value] where array_name is the name of the array and index_value is the index value or offset of the sought element. Unlike many other languages, C represents arrays only as it does pointers: a memory address and a data type. Therefore, index values can exceed the actual size of an array.

There also exist so-called multi-dimensional arrays. The index values of the arrays are assigned in row-major order. The proper syntax for referring to an element of a two-dimensional array would be array_name[row][column]; although this looks like it is performing multiple pointer dereferences, in reality it is perform an index calculation using multiplication, and pointers to each row are not stored anywhere in memory.

Because C is often used in low-level systems programming, there are cases where it may actually be necessary to treat an integer as a memory address, a double-precision value as an integer, or one type of pointer as another. For these instances, C provides casting, which forces the explicit conversion of a value from one type to another. The use of casts sacrifices some of the safety normally provided by the type system. The syntax for casting some value to type int would be (int)value.

Data storage

One of the most important functions of a programming language is to provide facilities for managing memory and the objects that are stored in memory. C provides three distinct ways of allocating memory for objects:

  • Static memory allocation: space for the object is provided in the binary at compile-time; these objects have a lifetime as long as the binary which contains them exists
  • Automatic memory allocation: temporary objects can be stored on the stack, and this space is automatically freed and reusable after the block they are declared in is left
  • Dynamic memory allocation: blocks of memory of any desired size can be requested at run-time using the library functions malloc(), realloc(), and free() from a region of memory called the heap; these blocks are reused after free() is called on them

These three approaches are appropriate in different situations and have various tradeoffs. For example, static memory allocation has no allocation overhead, automatic allocation has a small amount of overhead during initialization, and dynamic memory allocation can potentially have a great deal of overhead for both allocation and deallocation. On the other hand, stack space is typically much more limited than either static memory or heap space, and only dynamic memory allocation allows allocation of objects whose size is only known at run-time. Most C programs make extensive use of all three.

Where possible, automatic or static allocation is usually preferred because the storage is managed by the compiler, freeing the programmer of the error-prone hassle of manually allocating and releasing storage. Unfortunately, many data structures can grow in size at runtime; since automatic and static allocations must have a fixed size at compile-time, there are many situations in which dynamic allocation must be used. Variable-sized arrays are a common example of this (see "malloc" for an example of dynamically allocated arrays).

Syntax

See main article: C syntax

Problems

C permits many operations that are generally not desirable, and thus many simple errors made by a programmer are not detected by the compiler or even when they occur at runtime, leading to programs with unpredictable behavior and security holes. Part of the reason for this is to avoid compile and runtime checks that were too expensive when C was originally designed. Rather than placing these checks in the compiler, additional tools, such as lint, were used. Today many tools are available to allow a C programmer to detect or correct various common problems.

One problem is that automatically and dynamically allocated objects are not initialized; they initially have whatever value is present in the memory space they are assigned. This value is highly unpredictable, and can vary between two machines, two program runs, or even two calls to the same function. If the program attempts to use such an uninitialized value, the results are usually unpredictable. Most modern compilers detect and warn about this problem in some restricted cases.

Pointers are one primary source of danger; because they are unchecked, a pointer can be made to point to any object of any type, including code, and then written to, causing unpredictable effects. Although most pointers point to safe places, they can be moved to unsafe places using pointer arithmetic, the memory they point to may be deallocated and reused (dangling pointers), they may be uninitialized (wild pointers), or they may be directly assigned any value using a cast or through another corrupt pointer. Another problem with pointers is that C freely allows conversion between any two pointer types. Other languages attempt to address these problems by using more restrictive reference types.

Although C has native support for static arrays, it does not verify that array indexes are valid (bounds checking). For example, one can write to the sixth element of an array with five elements, yielding unpredictable results. This is called a buffer overflow. This has been notorious as the source of a number of security problems in C-based programs.

Another common problem is that heap memory cannot be reused until it is explicitly released by the programmer with free(). The result is that if the programmer accidentally forgets to free memory, but continues to allocate it, more and more memory will be consumed over time. This is called a memory leak. Conversely, it is possible to release memory too soon, and then continue to use it. Because the allocation system can reuse the memory at any time for unrelated reasons, this results in insidiously unpredictable behavior. These issues in particular are ameliorated in languages with automatic garbage collection.

Yet another common problem are variadic functions, which take a variable number of arguments. Unlike other prototyped C functions, checking the arguments of variadic functions at compile-time is not mandated by the standard. If the wrong type of data is passed, the effect is unpredictable, and often fatal. Variadic functions also handle null pointer constants in an unexpected way. For example, the printf family of functions supplied by the standard library, used to generate formatted text output, is notorious for its error-prone variadic interface, which relies on a format string to specify the number and type of trailing arguments. Type-checking of variadic functions from the standard library is a quality of implementation issue, however, and many modern compilers do in particular type-check printf calls, producing warnings if the argument list is inconsistent with the format string. It should be noted that not all printf calls can be checked statically (this is difficult as soon as the format string itself comes from somewhere hard to trace), and other variadic functions typically remain unchecked.

Tools have been created to help C programmers avoid many of these errors in many cases. Automated source code checking and auditing is fruitful in any language, and for C many such tools exist, such as Lint. A common practice is to use Lint to detect questionable code when a program is first written. Once a program passes Lint, it is then compiled using the C compiler. There are also libraries for performing array bounds checking and a limited form of automatic garbage collection, but they are not a standard part of C.

History

Early developments

The initial development of C occurred at AT&T Bell Labs between 1969 and 1973; according to Ritchie, the most creative period occurred in 1972. It was named "C" because many of its features were derived from an earlier language called "B". Accounts differ regarding the origins of the name "B": Ken Thompson credits the BCPL programming language, but he had also created a language called Bon in honor of his wife Bonnie.

There are many legends as to the origin of C and its related operating system, Unix, including:

  • The development of C was the result of the programmer's desire to play an Asteroids-like game. They had been playing it on their company's mainframe, but being underpowered and having to support about 100 users, Thompson and Ritchie found they didn't have sufficient control over the spaceship to avoid collisions with the wandering space rocks. Thus, they decided to port the game to an idle PDP-7 in the office. But it didn't have an operating system (OS), so they set about writing one. Eventually they decided to port the operating system to the office's PDP-11, but this was onerous since all the code was in assembly language. They decided to use a higher-level portable language so the OS could be ported easily from one computer to another. They looked at using B, but it lacked functionality to take advantage of some of the PDP-11's advanced features. So they set about creating the new language, C.
  • The justification for obtaining the original computer that was used to develop Unix was to create a system to automate the filing of patents. The original version of Unix was developed in assembly language. Later, the C language was developed in order to rewrite the operating system.

By 1973, the C language had become powerful enough that most of the UNIX kernel, originally written in PDP-11/20 assembly language, was rewritten in C. This was one of the first operating system kernels implemented in a language other than assembly, earlier instances being the Multics system (written in PL/I) and TRIPOS (written in BCPL).

K&R C

In 1978, Ritchie and Brian Kernighan published the first edition of The C Programming Language. This book, known to C programmers as "K&R", served for many years as an informal specification of the language. The version of C that it describes is commonly referred to as "K&R C." (The second edition of the book covers the later ANSI C standard, described below.)

K&R introduced the following features to the language:

  • struct data types
  • long int data type
  • unsigned int data type
  • The =+ operator was changed to +=, and so forth (=+ was confusing the C compiler's lexical analyzer; for example, i =+ 10 compared with i = +10).

K&R C is often considered the most basic part of the language that is necessary for a C compiler to support. For many years, even after the introduction of ANSI C, it was considered the "lowest common denominator" that C programmers stuck to when maximum portability was desired, since not all compilers were updated to fully support ANSI C, and reasonably well-written K&R C code is also legal ANSI C.

In the years following the publication of K&R C, several "unofficial" features were added to the language, supported by compilers from AT&T and some other vendors. These included:

  • void functions and void * data type
  • functions returning struct or union types
  • struct field names in a separate name space for each struct type
  • assignment for struct data types
  • const qualifier to make an object read-only
  • a standard library incorporating most of the functionality implemented by various vendors
  • enumerations
  • the single-precision float type

ANSI C and ISO C

During the late 1970s, C began to replace BASIC as the leading microcomputer programming language. During the 1980s, it was adopted for use with the IBM PC, and its popularity began to increase significantly. At the same time, Bjarne Stroustrup and others at Bell Labs began work on adding object-oriented programming language constructs to C. The language they produced, called C++, is now the most common application programming language on the Microsoft Windows operating system; C remains more popular in the Unix world. Another language developed around that time is Objective-C which also adds object oriented programming to C. While, now, not as popular as C++, it is used to develop Mac OS X's Cocoa applications.

In 1983, the American National Standards Institute (ANSI) formed a committee, X3J11, to establish a standard specification of C. After a long and arduous process, the standard was completed in 1989 and ratified as ANSI X3.159-1989 "Programming Language C". This version of the language is often referred to as ANSI C. In 1990, the ANSI C standard (with a few minor modifications) was adopted by the International Organization for Standardization (ISO) as ISO/IEC 9899:1990.

One of the aims of the ANSI C standardization process was to produce a superset of K&R C, incorporating many of the unofficial features subsequently introduced. However, the standards committee also included several new features, such as function prototypes (borrowed from C++), and a more capable preprocessor.

ANSI C is now supported by almost all the widely used compilers. Most of the C code being written nowadays is based on ANSI C. Any program written only in standard C is guaranteed to perform correctly on any platform with a conforming C implementation. However, many programs have been written that will only compile on a certain platform, or with a certain compiler, due to (i) the use of non-standard libraries, e.g. for graphical displays, and (ii) some compilers not adhering to the ANSI C standard, or its successor, in their default mode, or (iii) they rely on the exact size of certain datatypes as well as on the Endianness of the platform.

C99

After the ANSI standardization process, the C language specification remained relatively static for some time, whereas C++ continued to evolve. (Normative Amendment 1 created a new version of the C language in 1995, but this version is rarely acknowledged.) However, the standard underwent revision in the late 1990s, leading to the publication of ISO 9899:1999 in 1999. This standard is commonly referred to as "C99". It was adopted as an ANSI standard in March 2000.

The new features in C99 include:

  • inline functions
  • variables can be declared anywhere (as in C++), rather than only after another declaration or the start of a compound statement
  • several new data types, including long long int (to reduce the pain of the looming 32-bit to 64-bit transition), an explicit boolean data type, and a complex type representing complex numbers
  • variable-length arrays
  • support for one-line comments beginning with //, borrowed from C++, and which many C compilers have been supporting as an extension
  • several new library functions, such as snprintf()
  • several new header files, such as stdint.h

Interest in supporting the new C99 features appears to be mixed. Whereas GCC and several other compilers now support most of the new features of C99, the compilers maintained by Microsoft and Borland do not, and these two companies do not seem to be interested in adding such support.

Relation to C++

The C++ programming language was originally derived from C. However, contrary to popular opinion, not every C program is a valid C++ program. As C and C++ have evolved independently, there has been an unfortunate growth in the number of incompatibilities between the two languages [1]. The latest revision of C, C99, created a number of additional conflicting features. The differences make it hard to write programs and libraries that are compiled and function correctly as either C or C++ code, and confuse those who program in both languages. The disparity also makes hard for either language to adopt features from the other one.

Bjarne Stroustrup, the creator of C++, has repeatedly suggested [2] that the incompatibilities between C and C++ should be reduced as much as possible in order to maximize interoperability between the two languages. Others have argued that since C and C++ are two different languages, compatibility between them is useful but not vital; according to this camp, efforts to reduce incompatibility should not hinder attempts to improve each language in isolation.

Today, the primary differences between the two languages are:

  • inlineinline functions are in the global scope in C++, and in the file (so-called "static") scope in C. In simple terms, this means that in C++, any definition of any inline function (but irrespective of C++ function overloading) must conform to C++'s "One Definition Rule" or ODR, requiring that either there be a single definition of any inline function or that all definitions be semantically equivalent; but that in C, the same inline function could be defined differently in different translation units (translation unit typically refers to a file).
  • The bool keyword in C99 is in its own header, <stdbool.h>. Previous C standards did not define a boolean type, and various (incompatible) methods were used to simulate a boolean type.
  • Character constants (enclosed in single quotes) have the size of an int in C and a char in C++. That is to say, in C, sizeof('a') == sizeof(int); in C++, sizeof('a') == sizeof(char). Nevertheless, even in C they will never exceed the values that a char can store, so (char)'a' is a safe conversion.

C99 adopted some features that first appeared in C++. Among them are:

  • Mandatory prototype declarations for functions
  • Line comments, indicated by //; line comments end with a newline character
  • The inline keyword
  • The removal of the "implicit int" return value

See also

References

  • Brian Kernighan, Dennis Ritchie: The C Programming Language. Also known as K&R — The original book on C.
    • 1st, Prentice Hall 1978; ISBN 0-131-10163-3. Pre-ANSI C.
    • 2nd, Prentice Hall 1988; ISBN 0-131-10362-8. ANSI C.
  • British Standard Institute: The C Standard, John Wiley & Sons, ISBN 0-470-84573-2. The official ISO standard (C99) in book form.
  • Samuel P. Harbison, Guy L. Steele: C: A Reference Manual. This book is excellent as a definitive reference manual, and for those working on C compiler and processors. The book contains a BNF grammar for C.
    • 4th, Prentice Hall 1994; ISBN 0-133-26224-3.
    • 5th, Prentice Hall 2002; ISBN 0-130-89592-X.
  • Robert Sedgewick: Algorithms in C, Addison-Wesley, ISBN 0-201-31452-5 (Part 1–4) and ISBN 0-201-31663-3 (Part 5)

C

C99

Template:Major programming languages small