Discussion:
[clamav-users] Scanning very large files in chunks
sapientdust+
2016-08-04 23:40:49 UTC
Permalink
I've recently run into the issue of clamd not being able to scan files
that are larger than a small number of GB, and I have seen the
warnings in `man clamd.conf` that say specified limits above 4GB are
ignored.

Could developers or other folks familiar with the clamd codebase
comment on the feasibility of scanning large files in multiple pieces
as a way of handling larger files?

For example, given a file that is 6GB, does using multiple INSTREAM
calls (that's how I'm interacting wth clamd currently) to check the
full 6GB seem like it should work reliably?

INSTREAM: bytes 0-1000MB
INSTREAM: bytes 900MB-1.9GB
INSTREAM: bytes 1.8GB-2.8GB
INSTREAM: bytes 2.7GB-3.7GB
INSTREAM: bytes 3.6GB-4.6GB
INSTREAM: bytes 4.5GB-5.5GB
INSTREAM: bytes 5.4GB-6.0GB

There is overlap above, wherein the 100MB of data that starts at the
900MB position is scanned twice, once in the first call (as the last
100MB of that stream) and once in the second call (as the first 100MB
of that stream), to reduce the possibility of a virus being split into
two pieces and therefore not recognized.

If Clamav needs the first bytes in order to know
what kind of file it is scanning and trigger filetype-specific heuristics, then
something like the above could be adapted so that the first N bytes of the
first chunk are prepended to each subsequent chunk that is checked for
that file.

Thanks for any guidance or feedback you can provide.
_______________________________________________
Help us build a comprehensive ClamAV guide:
https://github.com/vrtadmin/clamav-faq

http://www.clamav.net/contact.html#ml
Al Varnell
2016-08-05 02:14:05 UTC
Permalink
Does anybody have any evidence of malware that exceeds 4GB? Although I can certainly see the utility of the proposed capability as a hedge for the future, it would seem to be a waste of time and compute power to scan such large files today.

With the ever increasing malware issues we face today, it’s important to consider this:

Risk = threat x vulnerability x consequence

<http://fortune.com/2016/05/14/cybersecurity-risk-calculation/>

We all need to focus on fixing the high risk items first.

Sent from Janet's iPad

-Al-
Post by sapientdust+
I've recently run into the issue of clamd not being able to scan files
that are larger than a small number of GB, and I have seen the
warnings in `man clamd.conf` that say specified limits above 4GB are
ignored.
Could developers or other folks familiar with the clamd codebase
comment on the feasibility of scanning large files in multiple pieces
as a way of handling larger files?
For example, given a file that is 6GB, does using multiple INSTREAM
calls (that's how I'm interacting wth clamd currently) to check the
full 6GB seem like it should work reliably?
INSTREAM: bytes 0-1000MB
INSTREAM: bytes 900MB-1.9GB
INSTREAM: bytes 1.8GB-2.8GB
INSTREAM: bytes 2.7GB-3.7GB
INSTREAM: bytes 3.6GB-4.6GB
INSTREAM: bytes 4.5GB-5.5GB
INSTREAM: bytes 5.4GB-6.0GB
There is overlap above, wherein the 100MB of data that starts at the
900MB position is scanned twice, once in the first call (as the last
100MB of that stream) and once in the second call (as the first 100MB
of that stream), to reduce the possibility of a virus being split into
two pieces and therefore not recognized.
If Clamav needs the first bytes in order to know
what kind of file it is scanning and trigger filetype-specific heuristics, then
something like the above could be adapted so that the first N bytes of the
first chunk are prepended to each subsequent chunk that is checked for
that file.
Thanks for any guidance or feedback you can provide.
Al Varnell
2016-08-05 03:08:59 UTC
Permalink
Certainly agree that many, many disk images are known to contain malware, but the usual approach there is to use a hash value for the file as there are other issues with attempting to scan within the image without mounting it first. The most recent versions of ClamXav now does both a hash check and scans after mounting.

-Al-
Disk images often contain whole file systems and thus many, many files.
The alternative is to scan the entire FS after it is "mounted". (Of
course disk images these days might be 6 TB rather than a mere 6 GB.)
sapientdust+
2016-08-08 16:56:33 UTC
Permalink
Post by Al Varnell
...
Risk = threat x vulnerability x consequence
I agree. In my case, the consequence factor is very large, and I have
to scan even the large files somehow. Skipping large files would just
provide an easy attack vector for the system that ClamAV is
protecting. In addition to the file types mentioned elsewhere in this
thread that can be larger than a few GB, I've personally seen
Photoshop files and PDFs and in the 3GB-7GB range.

Does anybody have any feedback on the proposed solution to scanning
large files in chunks? If I test a virus embedded in some large files
at various locations (just inserting the virus bytes into the file)
and verify that ClamAV does detect it reliably, are there any reasons
that the method wouldn't work for all file types, assuming that the
initial bytes of the file are prepended to each chunk so that ClamAV
knows what type of file it is?
_______________________________________________
Help us build a comprehensive ClamAV guide:
https://github.com/vrtadmin/clamav-faq

http://www.clamav.ne
G.W. Haywood
2016-08-09 16:40:11 UTC
Permalink
Hi there,
... Risk = threat x vulnerability x consequence
I agree. In my case, the consequence factor is very large ...
Perhaps you can elucidate the consequences. If the consequence factor
is as you say very large, then you have a problem to solve.
I have to scan even the large files somehow.
This will not solve the problem. It can never and will never solve it.
You need to find another way of going about things.
Skipping large files would just provide an easy attack vector ...
Then you have to fix the system so that it wouldn't be easy.
Does anybody have any feedback on the proposed solution to scanning
large files in chunks?
Stop worrying about it, it's a waste of time and effort. The probability
that you will actually find what you're looking for is very small.
... are there any reasons that the method wouldn't work for all file
types, assuming that the initial bytes of the file are prepended to
each chunk so that ClamAV knows what type of file it is?
Yes. Because of what I wrote above. Forget prepended bytes and fancy
ways of doing things that won't solve the problem. Look at the problem
in a different way. I'm sure this isn't what you want to hear, but it's
the way things are.

I don't worry about viruses. The reason for that is that I don't use
Windows boxes. The main reason I use ClamAV is to stop spam and similar
junk which third-party databases do pretty well. Scanning for viruses
and similar is just a bonus as far as I'm concerned, it means that if
something is found then we might be able to alert somebody to a problem
that they might have, or we might be able to avoid passing something
on from one correspondent to another through our mail.

But of course we might not find it.

Like all virus scanners, ClamAV performance is not 100% and it never
will be. I suspect it's nearer 30% of the viruses that my servers
see, but that's just my personal experience in what are probably very
atypical systems -- for a start, 25% of the internet address space is
firewalled and if a packet gets past the firewalls it gets harder from
there; spammers and purveyors of malware get firewalled for a single
offence, permanently, and their entire network gets firewalled, not
just the one IP that tried it on. Very atypical. But the point is
that I still see *new* threats which will not usually be found by any
scanner and if the system is vulnerable it will succumb.

If you want to test ClamAV performance, set up a mail server and grab
all the ***@p it sees for a few months. Run all that past a couple of
dozen virus scanners such as you can find on jotti.org and then come
back and tell us what you've found.
--
73,
Ged.

_______________________________________________
Help us build a comprehensive ClamAV guide:
https://github.com/vrtadmin/clamav-faq

http://www.clamav.net/contact.html#ml
Reindl Harald
2016-08-09 16:59:19 UTC
Permalink
Post by G.W. Haywood
Post by sapientdust+
Does anybody have any feedback on the proposed solution to scanning
large files in chunks?
Stop worrying about it, it's a waste of time and effort. The probability
that you will actually find what you're looking for is very small.
Post by sapientdust+
... are there any reasons that the method wouldn't work for all file
types, assuming that the initial bytes of the file are prepended to
each chunk so that ClamAV knows what type of file it is?
Yes. Because of what I wrote above. Forget prepended bytes and fancy
ways of doing things that won't solve the problem. Look at the problem
in a different way. I'm sure this isn't what you want to hear, but it's
the way things are.
on the other hand contentscanner limits are often around 260 KB with the
justification of "scantime maybe so much higher and nobody is sending
such large spam because of the sendrate" - in the meantime it's common
to have junk with a large attachment to just bypass the filter

back to clamav:
and if you scan "only" 20 MB files and not gigabytes - why should it be
a goal to allocate that memory at once to make troubles in case of
*parallel scans* like on a mailserver

that and that the meory usage of the signatures is already terrible
shows that clamav *will have* to deal with that problems in the future
since it#s already the main memory consumer on a inbound mailserver
sapientdust+
2016-08-09 17:21:38 UTC
Permalink
Post by G.W. Haywood
Hi there,
... Risk = threat x vulnerability x consequence
I agree. In my case, the consequence factor is very large ...
Perhaps you can elucidate the consequences. If the consequence factor
is as you say very large, then you have a problem to solve.
The specifics are not important to my question, which is about the
TECHNICAL feasibility of scanning in multiple pieces. If it won't work
reliably (relative to scanning files small enough to be scanned in
their entirety at once), that's fine, and I will have to switch to
another AV scanner, but I was hoping for some specific technical
reasons why it won't work before giving up on ClamAV.
Post by G.W. Haywood
I have to scan even the large files somehow.
This will not solve the problem. It can never and will never solve it.
You need to find another way of going about things.
What's the technical reason that it won't work?
Post by G.W. Haywood
Skipping large files would just provide an easy attack vector ...
Then you have to fix the system so that it wouldn't be easy.
Does anybody have any feedback on the proposed solution to scanning
large files in chunks?
Stop worrying about it, it's a waste of time and effort. The probability
that you will actually find what you're looking for is very small.
What are the technical reasons that the probability is very small
(compared to the probability of finding a virus if the file is small
enough to be scanned in one instream call)?
_______________________________________________
Help us build a comprehensive ClamAV guide:
https://github.com/vrtadmin/clamav-faq

http://www.clamav.net/contact.html#ml
G.W. Haywood
2016-08-10 17:11:21 UTC
Permalink
Hello again,
Post by sapientdust+
The specifics are not important to my question
Post by G.W. Haywood
Post by sapientdust+
In my case, the consequence factor is very large
...
Post by sapientdust+
Post by G.W. Haywood
Post by sapientdust+
Does anybody have any feedback on the proposed solution to scanning
large files in chunks?
Stop worrying about it, it's a waste of time and effort. The probability
that you will actually find what you're looking for is very small.
What are the technical reasons that the probability is very small
(compared to the probability of finding a virus if the file is small
enough to be scanned in one instream call)?
I didn't say anything about comparisons. You asked for feedback, I
gave you some, and I said you wouldn't like it. You're not going to
like it any better if you modify the question, because my feedback is
going to be the same. I've been using ClamAV for more than a decade
so I have a reasonable idea what it can achieve and what it can't.
Post by sapientdust+
If it won't work reliably (relative to scanning files small enough
to be scanned in their entirety at once), that's fine, and I will
have to switch to another AV scanner ...
You aren't making any sense. As I explained in the parts of my post
which you have conveniently snipped, no scanner can do what you want,
not least because there are such things as zero-day vulnerabilities.
--
73,
Ged.
_______________________________________________
Help us build a comprehensive ClamAV guide:
https://github.com/vrtadmin/clamav-faq

http://www.clamav.net/contact.html#ml
sapientdust+
2016-08-10 18:21:18 UTC
Permalink
Hello,

On Wed, Aug 10, 2016 at 10:11 AM, G.W. Haywood
Post by G.W. Haywood
Hello again,
Post by sapientdust+
The specifics are not important to my question
Post by G.W. Haywood
Post by sapientdust+
In my case, the consequence factor is very large
Those two statements are perfectly consistent. The consequences are
significant enough that I have to scan all files, but why the
consequence are large, or what the specific consequences are, don't
matter for my technical question.
Post by G.W. Haywood
Post by sapientdust+
Post by G.W. Haywood
Post by sapientdust+
Does anybody have any feedback on the proposed solution to scanning
large files in chunks?
Stop worrying about it, it's a waste of time and effort. The probability
that you will actually find what you're looking for is very small.
What are the technical reasons that the probability is very small
(compared to the probability of finding a virus if the file is small
enough to be scanned in one instream call)?
I didn't say anything about comparisons. You asked for feedback, I
gave you some, and I said you wouldn't like it. You're not going to
like it any better if you modify the question, because my feedback is
going to be the same. I've been using ClamAV for more than a decade
so I have a reasonable idea what it can achieve and what it can't.
I didn't say that you mentioned comparisons. I was making clear that
I'm not asking for 100% reliability and I'm not asking whether the
multi-scan idea is perfect in some general sense, but only whether
it's significantly worse than scanning a smaller file that doesn't
need to be broken into multiple pieces.

I'm interested in knowing if there are technical reasons why the
following two scenarios would work very differently:

scenario 1:

I scan a 2.5 GB file in one instream call

scenario 2:

I scan a 4.5 GB file in multiple instream calls, by scanning the first
3 GB in one call, and then making a second instream call that provides
the first N MB followed by the last 2 GB of the file.

Would clamav be expected to work similarly in the two cases in terms
of identifying a virus, assuming the virus is the same in the two
scenarios and it's in ClamAV's database? Or are there technical
reasons why ClamAV wouldn't detect the virus in the second scenario
but would in the first, even though the virus bytes are identical?

This is a question for clamav developers or those who understand the
codebase sufficiently to know the impact of scanning a partial file.

Should I have asked this question on the developer list? I asked here
because it looked like the developer list gets very little use, and I
thought developers would probably be on this list too.
_______________________________________________
Help us build a comprehensive ClamAV guide:
https://github.com/vrtadmin/clamav-faq

http://www.clamav.net/contact.html#ml
G.W. Haywood
2016-08-11 17:15:08 UTC
Permalink
Hello once again,
Post by sapientdust+
I scan a 4.5 GB file in multiple instream calls, by scanning the first
3 GB in one call, and then making a second instream call that provides
the first N MB followed by the last 2 GB of the file.
Would clamav be expected to work similarly in the two cases in terms
of identifying a virus, assuming the virus is the same in the two
scenarios and it's in ClamAV's database? Or are there technical
reasons why ClamAV wouldn't detect the virus in the second scenario
but would in the first, even though the virus bytes are identical?
There's a possibility of failing to find it in the second scenario.
It's anybody's guess what the probability will be; my guess would be
that the probability of that failure would be small compared with the
relatively large probability of not finding it at all in both cases.
Post by sapientdust+
This is a question for clamav developers or those who understand the
codebase sufficiently to know the impact of scanning a partial file.
I don't think so. Just think about it a bit:

Much of ClamAV's operation is looking for pattern matches.
Suppose you scan a 4.5GB file in two chunks.
Suppose half this mysterious 'huge file virus' is in the first chunk.
Presumably the other half is in the second chunk.
What happens if the pattern is designed to match the entire virus?
Post by sapientdust+
Should I have asked this question on the developer list?
No. You're a user, the developers' list is for working on ClamAV.
--
73,
Ged.
_______________________________________________
Help us build a comprehensive ClamAV guide:
https://github.com/vrtadmin/clamav-faq

http://www.clamav.net/contact.html#ml
sapientdust+
2016-08-12 21:32:22 UTC
Permalink
On Thu, Aug 11, 2016 at 10:15 AM, G.W. Haywood
Post by G.W. Haywood
Hello once again,
Post by sapientdust+
I scan a 4.5 GB file in multiple instream calls, by scanning the first
3 GB in one call, and then making a second instream call that provides
the first N MB followed by the last 2 GB of the file.
Would clamav be expected to work similarly in the two cases in terms
of identifying a virus, assuming the virus is the same in the two
scenarios and it's in ClamAV's database? Or are there technical
reasons why ClamAV wouldn't detect the virus in the second scenario
but would in the first, even though the virus bytes are identical?
There's a possibility of failing to find it in the second scenario.
It's anybody's guess what the probability will be; my guess would be
that the probability of that failure would be small compared with the
relatively large probability of not finding it at all in both cases.
I was hoping to hear from a developer, because non-developer "guesses"
don't help me very much. There is a definite answer to my question,
but only somebody familiar with the ClamAV code will know the answer.

From what I've been able to learn based on some emails on the
developer list, ClamAV specifies virus signatures as either offsets in
a file of a certain type (and they are only found at that offset), or
they are specified as a pattern that can appear anywhere in the file.
I think it most likely that any virus inserted in a very large file
(multiple GB) would probably be of the second kind, which will be
recognized if those bytes are scanned anywhere within the file, as
long as the correct file type is identified. That means that as long
as the first N bytes are prepended to each chunk so that ClamAV thinks
each file is a file of the same type, it will identify the virus in a
block from a large file as reliably as it would identify the same
virus in a large file that is nevertheless small enough to be scanned
in one call.
Post by G.W. Haywood
Post by sapientdust+
This is a question for clamav developers or those who understand the
codebase sufficiently to know the impact of scanning a partial file.
Much of ClamAV's operation is looking for pattern matches.
Suppose you scan a 4.5GB file in two chunks.
Suppose half this mysterious 'huge file virus' is in the first chunk.
Presumably the other half is in the second chunk.
What happens if the pattern is designed to match the entire virus?
First, the virus itself would not be huge. It would be just a normal
virus embedded in a large file, where almost all the size is
legitimate data. Every example I gave since my first email shows the
chunks being broken up in such a way that your scenario cannot arise,
because data that would be split is repeated in the next block in such
a way that it's not split into two pieces in the subsequent block.
Note above, where I said a 4.5 GB file would be scanned in two calls,
the first providing bytes 0-3GB, and the second providing the first N
bytes concatenated with the bytes from 2.5GB-4.5GB. The boundary at
3GB is squarely inside the second block and not split across blocks so
that it's still recognized in its entirety, as long as the virus isn't
larger than 500MB, which as far as I can tell is always the case for
the sorts of things that ClamAV can identify.
_______________________________________________
Help us build a comprehensive ClamAV guide:
https://github.com/vrtadmin/clamav-faq

http://www.clamav.net/contact.html#ml
TR Shaw
2016-08-12 21:51:39 UTC
Permalink
Actually there is always a probability that a detection will not occur if you beak apart at file into pieces This is due to the following

1) md5 signatures based upon any file type are applied on any file and match to the md4 hash of that file AND the file’s size. If you break apart a file, neither the hash nor the file size will match the signature.

2) Complex signatures that a logical grouping of the results of multiple other signature detections are the other type that can break if you break a file in pieces.

This question of breaking apart files and checking comes up regularly by folks who need to support high data rate inputs and still be NIST FISMA compliant and the answer is always not you can’t do that.

Tom
Post by sapientdust+
On Thu, Aug 11, 2016 at 10:15 AM, G.W. Haywood
Post by G.W. Haywood
Hello once again,
Post by sapientdust+
I scan a 4.5 GB file in multiple instream calls, by scanning the first
3 GB in one call, and then making a second instream call that provides
the first N MB followed by the last 2 GB of the file.
Would clamav be expected to work similarly in the two cases in terms
of identifying a virus, assuming the virus is the same in the two
scenarios and it's in ClamAV's database? Or are there technical
reasons why ClamAV wouldn't detect the virus in the second scenario
but would in the first, even though the virus bytes are identical?
There's a possibility of failing to find it in the second scenario.
It's anybody's guess what the probability will be; my guess would be
that the probability of that failure would be small compared with the
relatively large probability of not finding it at all in both cases.
I was hoping to hear from a developer, because non-developer "guesses"
don't help me very much. There is a definite answer to my question,
but only somebody familiar with the ClamAV code will know the answer.
From what I've been able to learn based on some emails on the
developer list, ClamAV specifies virus signatures as either offsets in
a file of a certain type (and they are only found at that offset), or
they are specified as a pattern that can appear anywhere in the file.
I think it most likely that any virus inserted in a very large file
(multiple GB) would probably be of the second kind, which will be
recognized if those bytes are scanned anywhere within the file, as
long as the correct file type is identified. That means that as long
as the first N bytes are prepended to each chunk so that ClamAV thinks
each file is a file of the same type, it will identify the virus in a
block from a large file as reliably as it would identify the same
virus in a large file that is nevertheless small enough to be scanned
in one call.
Post by G.W. Haywood
Post by sapientdust+
This is a question for clamav developers or those who understand the
codebase sufficiently to know the impact of scanning a partial file.
Much of ClamAV's operation is looking for pattern matches.
Suppose you scan a 4.5GB file in two chunks.
Suppose half this mysterious 'huge file virus' is in the first chunk.
Presumably the other half is in the second chunk.
What happens if the pattern is designed to match the entire virus?
First, the virus itself would not be huge. It would be just a normal
virus embedded in a large file, where almost all the size is
legitimate data. Every example I gave since my first email shows the
chunks being broken up in such a way that your scenario cannot arise,
because data that would be split is repeated in the next block in such
a way that it's not split into two pieces in the subsequent block.
Note above, where I said a 4.5 GB file would be scanned in two calls,
the first providing bytes 0-3GB, and the second providing the first N
bytes concatenated with the bytes from 2.5GB-4.5GB. The boundary at
3GB is squarely inside the second block and not split across blocks so
that it's still recognized in its entirety, as long as the virus isn't
larger than 500MB, which as far as I can tell is always the case for
the sorts of things that ClamAV can identify.
_______________________________________________
https://github.com/vrtadmin/clamav-faq <https://github.com/vrtadmin/clamav-faq>
http://www.clamav.net/contact.html#ml <http://www.clamav.net/contact.html#ml>
_______________________________________________
Help us build a comprehensive ClamAV guide:
https://github.com/vrtadmin/clamav-faq

http://www.clam
sapientdust+
2016-08-13 03:03:09 UTC
Permalink
Post by TR Shaw
Actually there is always a probability that a detection will not occur if you beak apart at file into pieces This is due to the following
1) md5 signatures based upon any file type are applied on any file and match to the md4 hash of that file AND the file’s size. If you break apart a file, neither the hash nor the file size will match the signature.
Thanks for the info! I don't quite understand this part though. In
http://lists.clamav.net/pipermail/clamav-devel/2015-March/000145.html,
Andy Singer explained that the "WIN.Trojan.DarkKomet:1:*:..." sig
would match the bytes anywhere in the file, so that's definitely not
taking the whole hash of the file into account.

It seems extraordinarily brittle to take the whole file digest into
account, because then a single bit flip anywhere in the file is enough
to evade clamav altogether, because it would be very easy to make
every file unique if clamav takes into account the size and the digest
of the full file.
Post by TR Shaw
2) Complex signatures that a logical grouping of the results of multiple other signature detections are the other type that can break if you break a file in pieces.
This question of breaking apart files and checking comes up regularly by folks who need to support high data rate inputs and still be NIST FISMA compliant and the answer is always not you can’t do that.
Tom
I see. Is that something you would expect to make a difference in
practice if the chunks were large (say 1 GB each)?

Thanks for your thoughts.
_______________________________________________
Help us build a comprehensive ClamAV guide:
https://github.com/vrtadmin/clamav-faq

http

Continue reading on narkive:
Loading...