ComputerVirus Coevolution The battle to conquer computer viruses is far from won, but new and improved antidotes are controlling the field.
Carey Nachenberg S RECENTLY AS SIX YEARS AGO,COMPUTER viruses were considered an urban myth by many. At the time, only a handful of PC cworimtepruAreterpgoarmmdemroses.virushav viruses had been written and infection was relatively uncommon. Today the situation is very different. As of November 1996, virus e than 10,000 DOS-based In addition to the sheer increase in the number of viruses, the virus writers have also become more clever. Their newer creations are significantly more complex and difficult to detect and remove. These “improvements” can be at least partially attributed to the efforts of antivirus
January 1997/Vol. 40, No. 1COMMUNICATIONS OF THE ACM
“latest and greatest” viruses, the virus authors invent new and more devious ways to hide their progeny. This coevolution has led to the creation of the most complex class of virus to date: thepolymorphiccomputer virus. The polymorphic virus avoids detection by mutating itself each time it infects a new program; each mutated infection is capable of performing the same tasks as its par-ent, yet it may look entirely different. These cunning viruses simply cannot be detected cost-effectively using traditional antivirus scanning algorithms. Fortunately, the antivirus producers have responded, as they have in the past, with an equally creative solution to the polymorphic virus threat. Many antivirus programs are now starting to employ a technique known asgeneric decryp-tionto detect even the most complex polymorphic viruses quickly and cost effectively. - -
Antivirus gram that spreads by attaching itself to executable files orThis detection technique worked well for quite a while. system areas on diskettes. Recently , we have also encoun -However, as the number of viruses steadily increased, tered a new type of vir us that infects application data filesantivirus programs had to become mor e clever in or der to that contain macros. These viruses are constructed entirelyrun at r easonable speeds. After all, scanning for hundreds of application macros and use the macro language to prop-of virus signatures through hundreds of kilobytes of exe-agate themselves.cutable pr ograms was a very slow process, especially on In addition to their ability to r eplicate, some computer4.77 megahertz computers! These speed problems led to viruses also deliver apayloadtion of the vir us pr—a porof the great innovations in early antiviro- oneus programs. gram that is designed to damage the host machine, display a message, or do some other mischief without the computerOST COMPUTER VIRUSES,INCLUDING MANY operator’s consent. This article focuses primarily on howof today’s complex viruses, either append computer viruses replicate and obscure themselves.themselves onto the end or place them-M The vast majority of computer viruses have beenselves at the beginning of their host exe -designed specifically for IBM-based PCs running the DOScutable file. Furthermore, almost all and Windows operating systems. In terms of sophisticationcomputer vir uses ar e less than 4KB in and functionality, these DOS-based viruses are generationslength. Antivirus researchers recognized these patterns and ahead of vir uses written for other operating systems andoptimized their vir us scanners accor dingly. The r esult of platforms. Consequently , this article examines how thethis improvement was antivirus programs that could detect antivirus community has tackled these DOS-based viruses.the vast majority of file viruses by scanning less than 8KB Nonetheless, the concepts presented apply for viruses ofeach executable program. This technique dramatically found on all operating systems and computer platfor ms.increased the speed and efficiency of antivirus products and The first file-infecting viruses were simple machine lan-is still being used today in some pr oducts. guage programs that had the ability to attach and spread Whilethese impr ovements significantly impr oved the identical copies of themselves from program to pr ogram.speed of the vir us scanner, even mor e optimizations wer e When the user executed an infected program, the virus possible.For example, almost all file-infecting viruses would take contr ol of the computer and infect additionalinfect at the entry-point of their host program. The entry-files. After the virus completed its mischief, it would trans-point of a pr ogram is the location in the program of the fer control to the host program and allow it to functionfirst machine-language instr uction that is executed when normally. This type of virus is called a “parasitic” computerthe program is launched. virus, since it does not kill its host; instead, the host acts asAs an analogy, consider a notepad with a list of instruc-a carrier.tions on how to complete a cer tain task, such as sending a Since these vir uses r eplicated identical copies of theirfax. (See Diagram 1A.) In order to send a fax, a person can machine code and data each time they infected a new pr o-simply follow the instructions on the pad from start to fin-gram, they pr oved easy for antivirus pr oducts to detect.ish. Computers interpret pr ograms, which are basically Every time the antivir us researcher, received a new vir us,sequences of simple instructions, in much the same way. In they would analyze the virus and identify a sequence ofour notepad example, the entry-point of the notepad is the machine-language instructions that wer e both unique toline that contains the first instruction in the list. This the vir us and pr esent in ever y one of its infections; thiswould be the first instruction that a human would look at sequence of bytes is known as a virus signatur e. Thewhen they start on their task. researcher could then insert this signature into theWhen a virus infects a new pr ogram, it places itself at antivirus program so that it could search for the virus.the entry-point of the host program or modifies the The first antivirus programs were glorified string-machine-language instructions at the host’s entry-point to searching pr ograms. The antivirus pr ogram would opentransfer control to its virus body. In the latter case, the virus each executable file on the computer and scan its entire bodyis usually located at the end of the host program. (See contents for a cur rent set of vir us signatures. If any of theDiagram 1B.) signatures were found anywhere in the target program, theWhen the user executes a program, the operating sys -antivirus program would report the infection to the user. Iftem loads the program into the computer’s RAM and then none of the signatur es were found, the antivir us programinstructs the CPU to start executing the instructions at the would report that the file was uninfected.program’s entry-point. Viruses infect at the entr y-point of
47 COMMUNICATIONS OF THE ACMJanuary 1997/Vol. 40, No. 1
their host because it guarantees that when the user executescuting and in memory. a program, the vir us is immediately given control of theWhen the virus infects a new pr ogram, it re-encrypts a computer. copyof the virus body (using a complementary encryption Since most computer viruses wer e found to infect orprogram that is also carried along in the virus body) and modify the entry-point of a host program, antivirus authorsappends this onto the host program. It also appends the realized they could make their AV programs even faster .decryption routine onto the host program. Most encr ypt-They r edesigned their scanners to scrutinize only theing viruses use a dif ferent encryption key each time they instructions that can be reached directly from the entr y-infect a new program; consequently, the encr ypted vir us point in each executable program. This type of honedbody appears dif ferent in each infection. This means the search is known as entry-point scanning. The entr y-pointvirus decryption program constitutes the only consistent, scanner works in the following manner:visible sequence of instructions that are passed from infec-tion to infection. (See Diagram 2.) 1ypted virus posed new pr oblems for antivir usget pr ogram’sThe encr. Establish a variable E contains the tar entry-point location.researchers. In the past, we could detect vir uses by search-2.The entry-point scanner examines the machine-languageus signatur es extracted fring for virus body .om the vir instruction at location E in the suspected pr ogram.Clearly, this is not an option with the encrypted virus since 3.If this instruction transfers control to another instructionit doesn’t maintain a consistent vir us body fr om infection (as in Diagram 1B), then set E to the location of the des -to infection. However, the encrypted virus does propagate tination instruction and go back to step 2.an identical copy of its small decryption routine from infec-4.tion to infection. Could this be enough to detect the virus?Search the bytes at location E for vir us signatures. As it tur ns out, the answer is “yes” in most cases. The Therefore, if this algorithm were applied to the program indecryption r outines used by most encrypting vir uses ar e Diagram 1A, the entry-point scanner would finish with anoften unique. Most of them are made up of at least 10 to E value of 1, and sear ch for viruses starting at the very first15 distinct bytes. In fact, many antivirus researchers can instruction. If applied to the program in Diagram 1B, thelook at a disassembly of a virus decr yption r outine and entry-point scanner would finish with an E value of 6, andimmediately identify the vir us’ strain without decr ypting search starting for viruses at the first viral instruction. andexamining the rest of the virus! Clearly this technique greatly r educes the number ofBecause of this, we were able to use the same vir us sig-bytes that must be searched for viruses and dramaticallynature scanning technique to detect many of the encrypt-improves antivirus performance. This entry-point ingviruses. However, there was one drawback: most vir us scanning technique continues to be used by many BeforeA. popular antivirus programs. 1Insert document in fax machine. (Program entry-point) Unfortunately, these early advances were soon 2 Dialthe phone number. countered by the virus writers who realized they 3 Hitthe SEND button on the fax. could make their viruses more difficult to detect if4 Waitfor completion. If a problem occurs, go back to step1. the viruses did not spr ead exact copies of themselvestask.5 End from file to file. This led to the creation of the 6 7 encryptedvirus. 8 The encrypted virus consists of two par ts: a small 9 program known as the vir us decryption routine, and an encrypted virus body. The virus decryption routine AfterB. is composed of a sequence of machine-language 1Skip to step 6. (Virus modified entry-point, transfers control to instructions that can decrypt the encrypted vir usthe virus body on line 6.) body. The encrypted virus body is just an encryptedthe phone number.2 Dial version of the same vir us machine-language instr uc-the SEND button on the fax.3 Hit tions used by the simple viruses described earlier. 4 Waitfor completion. If a problem occurs, go back to step1. 5 Endtask. If a user executes an infected program, the vir us 6 VIRUSinstructions decryption routine immediately executes and 7 VIRUSinstructions decrypts the machine-language instructions that 8 VIRUSinstructions comprise the r est of the vir us. The decr yption rou-9 Insertdocument in fax machine. (Stored by the virus) tine then transfers control to the newly decrypted virus body and allows the virus to execute normallDiagram 1.Notepad example before and after infection can see, the virus body is visible only when the virat the entry-point
48January 1997/Vol. 40, No. 1COMMU NICATIONS OF THE AC
decryption r outines wer e compact machine-lan-guage programs. Consequently, our virus detection signatures for these vir uses were also shor t, and as signatures got smaller, the likelihood of false iden -tification and misidentification increased. False identification occurs when an antivirus program incor rectly identifies an uninfected pro-gram as being infected. Misidentification occurs when an antivirus program correctly identifies that a program is infected, yet improperly identifies the strain of the infection. Both of these results ar e undesirable. If an antivirus pr oduct mistakenly identifies a pr ogram as being infected, significant time and money must be spent to determine whether or not the infection is legitimate. In order to solve these problems, we made one additional impr ovement to our virus scanner. We modified it so that it could perform a secondary ver-ification when it located a questionable virus decryption r outine signatur e. The scanner would attempt to decrypt the contents of the would-be virus with the assumption the virus employed one of a number of simple encr yption techniques. As it turns out, a high percentage of encrypted viruses use easily decryptable encryption schemes, so this veri -fication was often successful. By performing this secondary verification, the an program could definitively identify whether or not t gram in question was infected by a virus. This tec has been named “x-raying,” since the antivirus pr looks through the encryption at the insides of the vi These encrypted viruses pose few pr oblems for antivirus programs; they can be detected quickly an Today, however , the polymorphic virus provides headaches for antivirus researchers.
The Polymorphic Virus With biological vir uses, mutations over whelmingl in nonviable offspring. However , because of their s numbers, at least some of the mutated offspring a cessful. Computer viruses cannot afford to play this Russian roulette. If a simple computer vir us were to propagate ra mutated copies of itself, the odds ar e the mutated c would fail to exhibit virus-like pr operties. In the likely scenario, a mutated child vir us would cause i program to crash any time it was executed. This wo immediately reveal the virus to the user and limit it ity to successfully reproduce. Because of these realities, cur rent polymorphic puter viruses do not mutate at all; instead, they hav cially designedmutation enginesthat simulate the pr mutation.
Encrypted with a key value of1(see line 6). Notice how lines 7, 8 and 9 are encrypted; compare with Figure1B. A. 1Skip to step 6. 2 Dialthe phone number. 3 Hitthe SEND button on the fax. 4 Waitfor completion. If a problem occurs, go back to step1. 5 Endtask. 6 Startingat line 7, shift back each letter by one. B goes to A, T goes to S, etc. (Virus decryption loop) 7 WJSVTjotusvdujnot (Encrypted "VIRUS instructions") 8 WJSVTjotusvdujnot (Encrypted "XIRUS instructions") 9 Jotfsuepdvnfou jo gby nbdijof. (Encrypted "Insert document in fax machine.")
Encrypted with a key value of 2 (see line 6). B. 1Skip to step 6. 2 Dialthe phone number. 3 Hitthe SEND button on the fax. 4 Waitfor completion. If a problem occurs, go back to step1. 5 Endtask. 6 Startingat line 7, shift back each letter back by two. C goes to A, U goes to S, etc. (Virus decryptionloop) 7 XKTWUkpuvtwevkopu (Encrypted "VIRUS instructions") 8 XKTWUkpuvtwevkopu (Encrypted "VIRUS instructions") 9 Kpugtvfqewogpv kp hcz ocejkpg. (Encrypted "Insert document in fax machine.")
Diagram 2.The encryption virus, two different infections
their accompanying vir us. The mutation engine must be able to produce seemingly random programs that can prop-erly perform decryption. Some of the more complex muta-tion engines can generate billions of billions of billions of different decr yption r outines, making simple signature detection impossible. While many polymorphic viruses can be detected using augmented versions of the entry-point and head-tail scan -ning algorithms, until recently, antivirus researchers had no efficient way to detect the more complex strains. In the past, researchers wrote specialized detection programs in C, assembly language, or special virus detection script lan-guages to catch the these viruses. These detection programs examined the machine code at the entry-point of each file. If they found sequences of instructions resembled those gen-erated by a given mutation engine, they would report that the file was infected. Many antivir us companies still write such specialized detection programs today. In fact, with some of the more complex viruses, it is not feasible to detect the vir us using any other technique. Unfortunately, the cr eation of these hand-coded, virus-specific detection programs requires thorough virus analysis and significant programming effort by a trained antivirus researcher. In addition, because these detection programs work based on different principles than the underlying scanner, they often significantly reduce the efficiency of the antivirus product. Worst of all, these programs ar e pr edisposed to false identification and misidentification. The decryption rou-tines cr eated by the more complex engines can take so many dif ferent for ms that it is often difficult to discer n between a virus decryptor and a legitimate program.
The Solution Over the past two years, several antivirus companies have incorporated generic decryption (GD) technology into their virus scanners. This technology enables the antivir us program to detect even the most complex polymorphic viruses with ease, while maintaining fast scanning speeds. Using GD, an antivirus researcher can update a virus scan-ner to detect new polymorphic viruses within minutes or hours instead of days or months. Recall that all cur rent polymorphic viruses have a body of machine code that is encrypted and copied verbatim from infection to infection. When a program infected with a polymorphic virus is launched by the user, the polymorphic decryption routine executes and decr ypts this unchanging virus body. Once the decryption routine finishes decrypting the body, it transfers contr ol to the vir us so it can spread. Thus, when a file containing a polymorphic virus is exe -cuted, the virus is guaranteed to decrypt itself and reveal its innards; otherwise, it would not be able to execute and con-stitute a viral threat. The GD technique relies on this
50 January 1997/Vol. 40, No. 1COMMUNICATIONS OF THE ACM
behavior in order to detect polymorphic viruses. The GD scanner is comprised of a CPU emulator, a virus signature scanner, and an emulation contr ol module (ECM). When the user requests a scan of an executable file, the GD antivir us loads the suspect program into a soft-ware-simulated, vir tual computer . The program is then allowed to execute in this virtual computer as if it were running on a r eal computer. During this execution, if the target file is infected with a virus, it can cause absolutely no damage to the actual PC since it is executing in a com-pletely contained environment. At the start of the simulation, the emulator begins exe-cuting the instructions at the infected program’s entr y-point. These instructions usually transfer control to, or directly constitute, the virus’ decryption routine. Just as on a r eal computer, the decryption r outine wastes no time deciphering the body of the virus. The emulation contr ol module regulates the emulation and continues it as long as its required. The GD constantly monitors the progress of the emula-tion session. At regular intervals, it calls upon its signature scanner to search through modified and potentially decrypted regions of vir tual memory for vir us signatures. Effectively, the vir us does all the decryption work for the antivirus program. As a result, even vir uses that employ arbitrarily complicated encryption schemes can be detected with ease. Furthermore, the GD scanner can exactly iden-tify the strain of the vir us since it has access to the entire, decrypted virus body. In essence, this is like injecting a mouse with a serum which may or may not contain a vir us, and then observing the mouse for any adverse effects. If the mouse becomes ill, (i.e., the vir us manifests itself) we can observe the visible symptoms and identify the virus. If the mouse stays healthy, then we can select another vial of serum and repeat the process. The benefit of GD is it pr ovides accurate identification of polymorphic viruses and dramatically reduces the possi-bility of false identification or misidentification. This behavior r esults because the scanner detects viruses by examining the static virus body instead of the polymorphic section of the virus, which can take numerous forms. Unfortunately, it is not such an easy chor e to create the perfect generic decryption scanner. Perhaps the most diffi-cult task in constructing a GD scanner is designing the emulation control module. If the GD scanner knew that it would always be called upon to scan infected files, it would be trivial to design a perfect ECM. Its one and only rule would be: emulate the target program until one of the viral strains is definitively identified, then quit. This situation would be ideal but it is not realistic. The majority of users do not have virus-infected files on their computer. Therefore, the ECM must decide how long to
emulate each pr ogram befor e giving up in its search fortine that used 80286-specific machine-language instruc-viruses. For example, if the ECM were to emulate each pro-tions, then the 80486 CPU emulator might be unable to gram for five minutes, it would allow the GD scanner toproperly emulate and decrypt the virus. Granted, the virus catch almost all polymorphic viruses. However , thewould be unable to function on a real 80486-based com -antivirus program would be too slow for the average user. putereither. However, it still might be a viable virus on Conversely, if the ECM only emulated the first 15many older machines. instructions of each program, unless it observed someHow might a GD antivir us program handle this pr ob-decryptor-like behavior, it might miss some viruses. Forlem? Well, the obvious answer is that it could be designed instance, some polymorphic viruses employ “do-nothing”to simulate ever y type of CPU for all scanned files. Each instructions within the polymorphic decr yption routine toand every scanned program would be emulated 20 different conceal their presence. If the ECM does not emulate enoughtimes, for each of the diferent major types of processors and instructions, it may be fooled by these do-nothing instr uc-in-line processor revisions. Unfortunately, this would dra -tions and terminate emulation prematurely.matically slow down the antivirus pr oduct to the point The difficulties in designing an ECM may remind somewhere it would be unusable. readers of the famous T uring Halting Pr oblem. The gr eatAlternatively, the virus could be detected using a more mathematician Alan Turing proved it is impossible to con-traditional technique, such as a hand-coded antivirus mod-The encrypted virus does propagate anidentical copy of its small decryption routine from infectionto infection. Could this be enough to detect the virus?