MAW 97 Theme Essay: Math and the Internet

Mathematics and the Internet
MAW 97 Theme Essay
(full version 2.1)

by Paul Davis
pwdavis@wpi.edu
Mathematical Sciences Department
Worcester Polytechnic Institute

Home
What's New?

Local Activities
National Events

Articles
Web Links
Visuals

Planning&Advertising
About This Site

Table of Contents: Introduction
Managing data on the Internet
Security on the Internet
Databases and searching
Routing and network configuration
Mathematics on the Web
Mathematics and the Internet

Introduction

he relationship between mathematics and the Internet is like that between language and the works of Shakespeare: his work could not have been conceived without language, while his poems and plays have enriched language as it evolved.

Computers were born in the language of mathematics. Binary numbers let computers represent words, music, images and more so that machines can now communicate across the Internet with an alphabet of 0's and 1's. The impartial rules of mathematical logic govern computer operations, Internet addressing, and even Web search engines.

Within the Internet, mathematics is at the heart of security for messages and financial transactions. It is the basic tool of data compression, coding, and error correction for transmitting large files. It is the foundation of databases for managing email addresses and for searching the World Wide Web, and it is the agent for routing messages and managing networks.

The Internet is also helping advance mathematical research and education. Groups of educators and researchers communicate through email, newsgroups, and special World Wide Web sites. The Internet also supports distributed computing such as the recent cooperative effort which linked computers across dozens of countries to crack a code once thought secure for 20 millennia.

The 1997 Mathematics Awareness Week theme poster uses visualizations developed by Bell Laboratories that depict world-wide Internet traffic over a two-hour period. The color and thickness of arcs between countries show inter-country traffic, with higher and redder arcs indicating larger traffic flows.

[Ed. note: as additional web links are proposed and verified they will be incorporated as hypertext in this essay.]

Managing data on the Internet

As most people know, Internet messages --- email, graphics, sound, the results of database searches --- are transmitted as strings of 0's and 1's. Mathematics is central to two parts of this digital translation and transmission:

accurately transmitting a text message, say, that has been translated into binary numbers requires codes for detecting and correcting errors (not to be confused with secret codes), and
reducing the volume of data in an image, for example, that must be transmitted and then reconstructed as a reasonable likeness of the original, uses the tools of data compression.

When massive strings of 0's and 1's are forced over computer networks, some errors are inevitable, and even small losses of data can be catastrophic. Error detecting codes introduce mathematical tools to detect many of those losses, much like counting the number of pages in a long letter as a way of determining if anything was lost in the mail.

A standard tool for detecting errors in Internet transmission are cyclic codes; a common choice is CRC-16, a cyclic redundancy code that can detect errors in as many as 16 consecutive bits in a message. CRC-16 can also catch about 99% of errors longer than 16 bits. When an error is detected, the receiving computer simply fails to acknowledge its arrival, and the sender knows to retransmit.

These codes perform a special kind of division on the numerical representation of the message. The sender can attach the remainder from that division to the message without adding much to its length, but the extra information enables the receiver to verify the message by repeating the division. Getting the wrong remainder means the message was corrupted.

The relevant coding ideas were first introduced in the early 1950s; R. W. Hamming and D. A. Huffman did some of the earliest work. The mathematical ideas of algebraic coding theory, which emerged in the 1960s and built on the older discipline of finite fields, enabled error detection and correction to blossom to its current effective state. For example, the Reed-Solomon error-correcting codes, which were introduced in the 1960s by Irving Reed and Gustave Solomon as an application of ideas in finite fields, provide error detection and correction so effective that they are used in devices ranging from satellites to compact disks.

Like error correcting codes, data compression ideas are also shared across a wide range of technologies, including the forthcoming digital television. (One second of high definition, uncompressed video would require more than seven hours to arrive over a conventional home modem!) The challenge of data compression is to reduce by many orders of magnitude the volume of data, and hence the transmission time, while preserving all the visually important parts of the image.

Good data compression schemes help World Wide Web graphics appear quickly and attractively on a computer screen. The same tools bring sound files that please the ear, even though selected parts have been removed or reconstructed. Some of the latest data compression ideas use wavelets, a kind of multiscale analysis tool.

Wavelets are a mathematical tool that has been developed in the last decade by A. Grossman, Stephen Mallet, Ingrid Daubechies and others to circumvent the limitations of classic Fourier analysis, which is restricted to analyzing fundamental frequencies. The Fourier approach can easily reproduce a long continuous tone by knowing just the amplitudes of its key harmonics. But short, bursty signals --- the kind heard in real music or seen in an image like a fingerprint --- require a tool that can work across different frequency scales and time windows.

Although Fourier analysis can be bent to this task, wavelets are much better suited to many signal-processing applications because they are built by reproducing a particular scaled view of a fundamental signal component. They naturally accommodate compact storage of an image like a fingerprint, for example, even though its pattern of ridges extends for only a finite distance across the page.

Security on the Internet

Security on the Internet is as important as the security of a bank vault. Security concerns encompass privacy of messages, integrity of computers connected to the Internet, and trust in financial transactions, among many other issues. The rapidly growing Internet marketplace, for example, depends heavily on secret codes that combine centuries-old number theory with discoveries of the past two decades.

Moreover, efforts to break such codes use the Internet to distribute the computing burden over a wide array of machines. That distributed computing in turn depends in a crucial way on modern extensions of an old idea of Fermat for methodically searching for prime factors of large numbers.

Internet security can be seen in two complementary parts. One is the problem of sending a message that only the recipient can read, insuring both confidentiality of the message and its fidelity. The other is verifying the identity of the sender of a message. The first amounts to finding a code which is hard to crack while still permitting rapid transmission and decoding. The second is the problem of digital signatures: how can an Internet merchant be sure that the signature on an electronic check is genuine? The solutions to both problems rest squarely on the shoulders of number theory, a deceptively deep branch of mathematics.

Conventional encryption schemes like the Data Encryption Standard (DES) function like a locked mailbox to which the sender and recipient each have the only two keys. The problems here are securely transmitting keys to new pairs of correspondents and managing the large collection of keys a busy correspondent needs. Public-key or RSA systems (named for R. L. Rivest, A. Shamir, and L. Adleman, who published the first workable scheme in 1978, based on ideas of W. Diffie and M. Hellman) have been likened to open mailboxes that can be slammed shut by any sender who wishes to deposit a message but opened only by the owner of the mailbox, the recipient. That is, anyone can code a message for a given recipient but only the recipient can decode it.

Coding a public-key message requires knowledge of two (large) numbers, the so-called public key; decoding it requires a third number, the private key known only to the recipient. The coding and decoding steps use modular arithmetic, a kind of clock face arithmetic (for example, dividing a journey time of many, many hours by 24 to find the remainder that determines the time of day of arrival).

When new members join a public-key encryption scheme, the first step in establishing their keys is their (random) choice of two large prime numbers. Then the keys are calculated from those prime numbers in a series of steps based on a two-century-old theorem of Euler. The public-key scheme is secure unless those two prime numbers can be recovered by factoring one of the public key numbers; the RSA system uses numbers of 129 digits because their prime factors are so difficult to find.

Indeed, Rivest, Shamir, and Adleman believed their scheme to be so secure that in 1977 they challenged the world to decode a message that had been encoded into 128 digits. At that time, they estimated that factoring would take 23,000 years! Yet in 1994, an informal group of 600 volunteers in more than two dozen countries, communicating through the Internet, collected idle CPU cycles on all sorts of computers --- even fax machines --- to put to work Carl Pomerance's 1981 quadratic sieve algorithm, a descendant of 350-year-old ideas of Fermat. After eight months of work, the 64 and 65 digit factors were discovered.

The RSA 128 digit challenge number was cracked using the Internet, albeit not very quickly. Distributed computing over the Internet cracked the wall protecting the confidentially of distributed communication! Increased security can be easily attained by using public keys with more digits.

The digital signature problem --- signing electronic checks, for example --- is solved by reversing the public key process. The sender transmits both the message and a "decoded" version of the message. If a recipient can recode the decoded version and recover the original message, then it is genuine. Once again, the burden of factoring large numbers provides the necessary security.

Meanwhile, mathematics is heavily involved in other assaults on a variety of security schemes. These involve a systems approach to teasing out the mathematical key to the code involved.

A recent report from the National Academy of Sciences - Cryptography's Role in Securing the Information Society - has more information.

Databases and searching

Powerful Web search engines like AltaVista and Yahoo! let Internet users find specialized nuggets of information hidden all over cyberspace. The heart of most of these search tools is an index of key words; each index entry lists the Web sites that contain that key word. (The entry for "mathematics" in one search index lists 332,966 sites!) Ideally, the search engine returns not just the intersection of all index entries for the given key words but also a priority score reflecting the potential relevance of each listed site to the searcher's needs.

Some of the latest thinking on balancing comprehensive search with relevance has led to a vector space model of the information in the index. The coordinates of the space are the terms in the index, the key word vocabulary through which one can search. Each Web site is a point in that space whose coordinates are determined by its "hits" on the key words, perhaps giving the most relevant key words the largest coordinate values. Sites with similar information are represented by points in this space that are near one another in some sense.

Searching is then a problem of finding nearest neighbors in a space of very high dimension, ideally with computational effort that grows no more rapidly than the dimension of the space. Imposing on these spaces probabilistic models of how information is distributed leads to nonstandard geometries; for example, the familiar triangle inequality --- two sides of a triangle are always longer than the third --- can fail, further complicating the challenge of finding efficient search algorithms.

From an algebraic perspective, the vectors of key word coordinates can be thought of as columns in a key word-to-web site incidence matrix --- across the row corresponding to a given key word are nonzero entries for each site containing the key word. The goal is finding sites with similar information by discovering relations among key words, such as between "mathematics" and "number", which might not have occurred to the searcher.

From this incidence matrix, one can extract certain preferred directions, known as eigenvectors, in the key word vector space. Corresponding to each eigenvector is a measure of its importance, its so-called singular value. Sites lying in the direction defined by an eigenvector share the common information it depicts. The larger singular values identify the most significant of these semantic overlaps. Computing eigenvectors and singular values is a formidable problem for such large matrices, but this technique shows promise for producing searches whose results are complete and highly relevant.

In reality, search engines do not explicitly manipulate matrices with hundreds of thousands of rows and columns. Instead, they rely upon clever computational implementations of databases.

Many databases are built around the mathematical object known as a tree. These trees are like family trees that record relations among parents and children and their ancestors and descendants. An index, for example, might consist of twenty-six family trees, one for each letter of the alphabet. The first level of children would be all legal two-letter combinations, and so on; for example, aardvark would be a distant descendant of aa.

Beyond the parent-child connection, relational databases define additional relationships among their entries. The power of a relational database comes from its ability to manipulate those relations; e.g., performing an intersection operation that can find a common string of letters appearing in two different words. The rules for those manipulations are mathematically defined in a relational algebra or relational calculus specific to that database structure. Mathematics is the framework for describing database constructs, and mathematical tools are the basis for improving their efficiency and reliability.

Routing and network configuration

A local area network of moderate size might have 10,000 pairs of nodes that communicate with one another. The messages they share are like trains running at the speed of light on the tracks of the network. Each car in the train carries part of one message, as if a long letter had been written on a series of postcards, one card per car. Typically, cards from many messages are mixed in one train.

The performance of the network depends on the length of the trains --- the size of the message packets --- and on the space between the trains. For example, a long message train that arrives at the wrong time can delay many other messages until it passes; short messages properly spaced can be slid in among one another.

The mathematical ideas of queuing theory can predict the behavior of message handling protocols based on information about the size and arrival patterns of these message packets. (The classic application of queuing theory is estimating the waiting time at a bank, given the arrival patterns of customers and the service time of the bank teller.)

But investigations of alternate message handling protocols are based on mathematical models of the message traffic. Good models assure that a new protocol will perform as well in practice as queuing theory predicts; bad models can lead a protocol developer to make performance promises that can't be fulfilled.

Scientists at Bellcore, AT&T Labs, and Boston University have recently discovered that network traffic has the fractal property of self-similarity over time, a finding that provides a sound physical basis for developing simpler, more accurate Internet traffic models with which to test proposed protocols. The self-similarity ideas are mathematics imported from a prior analysis of commodity markets by Benoit Mandelbrot. The central physical notion is that interactions of a computer with the network occur over a wide range of time scales, much like a human's interaction with a single computer.

With a good model of network traffic data in hand, the designers of network traffic management protocols then face a delicate balancing act between sending data over shortest paths and reducing congestion, much like a rush-hour commuter who has to decide whether to risk congestion on the direct highway into town or to take a more circuitous but less traveled back road.

When traffic is light, propagation time dominates, and message packets are best sent by the shortest path. Finding the shortest path through a network is a well studied problem in the area of mathematics known as graph theory. (R. E. Bellman, L. R. Ford, and E. W. Dijkstra are among the mathematicians who first developed shortest path algorithms in the late 1950s.) When traffic increases, the network router needs to find all paths between the sender and receiver, a task that can be handled using modern graph search techniques developed in the 1970s by R. E. Tarjan, J. E. Hopcroft and others. Knowing that both the shortest path and all available paths can be found, many network routing protocols then focus on different criteria for making the commuter's choice between the short but possibly congested road and the longer but freely flowing route.

Mathematics on the Web

Mathematicians take full advantage of the Internet and the World Wide Web. These tools let them share ideas, techniques, and resources across geographic and disciplinary boundaries to advance both teaching and research.

Central hubs for a wide range of mathematical activity, including considerations of the role of mathematics in society, are the Math Forum and the home pages of the three societies that sponsor Mathematics Awareness Week: the American Mathematical Society, the Mathematical Association of America, and the Society for Industrial and Applied Mathematics.

Examples of more specialized sites are the Math Archive ,which specializes in educational issues, and the Geometry Center ,whose focus is computation and visualization of geometric structures. Number theorists interested in the search for so-called Mersenne primes pool their resources through the Great Internet Mersenne Prime Search . For many years, computational scientists have shared problems, solutions, and methods through NA-Net, where the best public-domain numerical analysis software is available for downloading.

Mathematics and the Internet

Mathematics is the language of Internet operation, from the binary numbers that describe text and images to the complex data structures of search engines for the World Wide Web. Adroit combinations of old and new ideas from fields like number theory have enabled such key Internet technologies as data encryption for secure financial transactions. At the same time, the Internet has given birth to world-wide collaborations among mathematics teachers and researchers, collaborations that are advancing both education from kindergarten through university and our understanding of some of the most difficult problems of pure and applied mathematics.

Mathematics Awareness Month is sponsored each year by the Joint Policy Board for Mathematics to recognize the importance of mathematics through written materials and an accompanying poster that highlight mathematical developments and applications in a particular area.

Mathematics and the Internet MAW 97 Theme Essay (full version 2.1)