Understanding your Data
Let's say you scanned your environment and you came up with a report like this for your top 100 files. There's a lot you can glean from this, but you really need to bucket it into groups. Is it collaborative, is it designed for the file system, or is it archived data that really isn't something you'd want to put in your highly optimized collaboration system. The "buckets" will help you understand your data and help you understand what should stay on file servers and what is potential for SharePoint environments.
The data in this table comes from a scan of 50 file servers of various ages in their own lives. The data was as old as 1987 and as new as 2004. What you see here in this table of 7TB will help you conceptualize what is in 29 million files on a file server. Someone who is looking at this data from a search perspective could line up the extensions with the out of box filtering and what they would get. Would love to see someone do that work to see what types line up with what Ifilters and what simply doesn't make sense. I think you'd fine it's more than 50% of the size here that isn't indexable or worth indexing. Someone who looks at this from an Excel services perspective would think... wow that's a lot of Excel files! 1 Million+ files. What's going on? He'd find there's a lot of automation going on. With this automation how much of it could be turned into automated realtime reports? The Document Management person would look at this data and say... hmm how should I divide this up. Currently it's 3000 shares with nearly a million folders. Would I try to consolidate this into a single doc library? Craziness right? 3000 Doc libraries, No. 3000 Site Collections? Probably not, but we're getting closer. There is no straight across here. The shares will need to be taken one by one with a similar assessment of what are they doing and what of what they are doing maps across and will find value with the platform. If I were to guess, I'd say 5% of current shares who used this platform will need to keep using it. 90% of it can be set to read only and archived. The remaining 5% need to be moved over with a lot of hand holding, and a lot of people need training. The 3000 share admins need site collection admin training, the 15000 folder creator design type people need advanced user training, and those who consume need the mass user training, 1 day quick start type training with brown bags, and lots of FAQs and CBTs.
|
Extension |
Count |
Total KB |
Avg KB |
Type |
|
.xls |
1,348,039 |
889,815,861 |
660 |
collab |
|
.ppt |
634,476 |
695,605,118 |
1,096 |
collab |
|
.mdb |
40,055 |
517,919,309 |
12,930 |
database |
|
.pst |
5,402 |
452,226,788 |
83,715 |
database |
|
.exe |
1,003,154 |
443,317,354 |
442 |
code |
|
.a |
6,448,513 |
395,120,258 |
61 |
code |
|
.Doc |
1,597,277 |
353,493,932 |
221 |
collab |
|
.zip |
95,400 |
350,289,986 |
3,672 |
archive |
|
.dat |
389,277 |
285,793,707 |
734 |
code |
|
.bak |
935,739 |
279,483,027 |
299 |
archive |
|
.wmv |
10,212 |
192,115,586 |
18,813 |
media |
|
.mpg |
4,466 |
146,872,916 |
32,887 |
media |
|
.avi |
23,233 |
137,761,598 |
5,930 |
media |
|
.dll |
609,241 |
131,857,838 |
216 |
code |
|
.txt |
796,876 |
131,488,253 |
165 |
archive |
|
.tif |
199,706 |
102,649,503 |
514 |
media |
|
.bmp |
666,019 |
99,430,540 |
149 |
media |
|
.cab |
154,217 |
99,184,009 |
643 |
archive |
|
.jpg |
1,310,690 |
90,419,583 |
69 |
media |
|
.pdb |
119,795 |
90,310,440 |
754 |
database |
|
.asf |
12,243 |
84,531,461 |
6,904 |
media |
|
.rtf |
512,300 |
83,360,156 |
163 |
collab |
|
.eps |
105,302 |
81,575,934 |
775 |
media |
|
.psd |
68,747 |
74,008,030 |
1,077 |
media |
|
.bkf |
120 |
69,849,449 |
582,079 |
archive |
|
.img |
6,302 |
66,741,144 |
10,590 |
archive |
|
.log |
326,832 |
63,391,414 |
194 |
archive |
|
.pch |
14,392 |
59,582,768 |
4,140 |
code |
|
.pdf |
88,828 |
53,645,798 |
604 |
database |
|
.b |
1,165,018 |
53,643,747 |
46 |
code |
|
.lib |
167,543 |
49,961,933 |
298 |
code |
|
.wav |
204,197 |
37,541,945 |
184 |
media |
|
.Msi |
5,753 |
36,796,121 |
6,396 |
code |
|
.dbg |
90,874 |
21,881,280 |
241 |
archive |
|
.obd |
65,715 |
21,462,285 |
327 |
collab |
|
.sys |
28,624 |
21,434,357 |
749 |
code |
|
.htm |
3,090,583 |
21,205,880 |
7 |
code |
|
.gho |
33 |
20,643,567 |
625,563 |
archive |
|
.h |
1,559,539 |
20,619,691 |
13 |
code |
|
.msg |
151,634 |
20,613,532 |
136 |
archive |
|
.tmp |
78,542 |
19,654,767 |
250 |
archive |
|
.qic |
116 |
19,047,170 |
164,200 |
archive |
|
.mdf |
1,149 |
17,563,350 |
15,286 |
database |
|
.wma |
7,480 |
17,425,905 |
2,330 |
media |
|
.pqi |
60 |
17,351,949 |
289,199 |
archive |
|
.cpp |
790,570 |
16,886,762 |
21 |
code |
|
.csv |
39,737 |
15,347,065 |
386 |
archive |
|
.gif |
2,694,004 |
14,898,135 |
6 |
media |
|
.hlp |
76,309 |
13,883,602 |
182 |
archive |
|
.chm |
27,153 |
13,615,917 |
501 |
archive |
|
.c |
549,619 |
13,043,244 |
24 |
code |
|
.bcp |
8,142 |
11,675,017 |
1,434 |
code |
|
.mix |
19,644 |
11,535,236 |
587 |
media |
|
.png |
117,184 |
11,462,540 |
98 |
media |
|
.ocx |
30,898 |
10,132,518 |
328 |
code |
|
.wmf |
415,077 |
10,057,656 |
24 |
media |
|
.mov |
2,560 |
9,340,224 |
3,649 |
media |
|
.z |
4,185 |
9,101,792 |
2,175 |
code |
|
.ost |
254 |
8,782,743 |
34,578 |
pst |
|
.mmf |
642 |
8,352,272 |
13,010 |
archive |
|
.pub |
9,174 |
8,217,713 |
896 |
collab |
|
.rpt |
44,362 |
7,903,887 |
178 |
database |
|
.mpeg |
477 |
7,675,777 |
16,092 |
media |
|
.DL_ |
91,698 |
7,632,741 |
83 |
code |
|
.iso |
297 |
7,324,475 |
24,662 |
collab |
|
.blg |
280 |
7,258,503 |
25,923 |
code |
|
.lsg |
625 |
6,902,644 |
11,044 |
code |
|
.dcl |
5,316 |
6,846,094 |
1,288 |
code |
|
.map |
37,304 |
6,571,169 |
176 |
media |
|
.dir |
18,136 |
6,278,136 |
346 |
archive |
|
.obj |
141,572 |
6,191,208 |
44 |
code |
|
.bin |
23,215 |
5,936,244 |
256 |
code |
|
.cap |
4,997 |
5,859,952 |
1,173 |
media |
|
.bsc |
2,302 |
5,581,085 |
2,424 |
code |
|
.ttf |
33,653 |
5,178,857 |
154 |
media |
|
.sbr |
25,935 |
5,140,563 |
198 |
code |
|
.al |
11,968 |
4,772,571 |
399 |
code |
|
.trc |
1,007 |
4,680,717 |
4,648 |
archive |
|
.idf |
14,416 |
4,572,195 |
317 |
media |
|
.ldf |
782 |
4,205,407 |
5,378 |
database |
|
.mvb |
495 |
3,803,284 |
7,683 |
media |
|
.xlk |
3,488 |
3,612,669 |
1,036 |
archive |
|
.Ex_ |
31,551 |
3,412,071 |
108 |
code |
|
.dot |
34,600 |
3,194,595 |
92 |
archive |
|
.res |
72,394 |
2,419,928 |
33 |
code |
|
.ilk |
4,913 |
2,324,942 |
473 |
code |
|
.msm |
5,474 |
2,005,124 |
366 |
archive |
|
.trn |
2,034 |
1,716,905 |
844 |
archive |
|
.opt |
13,721 |
1,625,095 |
118 |
code |
|
.out |
30,087 |
1,539,555 |
51 |
archive |
|
.evt |
1,222 |
1,493,896 |
1,223 |
archive |
|
.inst |
1,396 |
1,468,698 |
1,052 |
code |
|
.ivt |
266 |
1,446,890 |
5,439 |
database |
|
.pps |
945 |
1,443,456 |
1,527 |
collab |
|
.oab |
153 |
1,298,709 |
8,488 |
code |
|
.fpx |
1,267 |
864,848 |
683 |
media |
|
.arc |
1,423 |
771,163 |
542 |
archive |
|
.pkg |
3,668 |
739,020 |
201 |
code |
|
.odb |
956 |
722,256 |
755 |
database |
|
.warn |
2 |
661,256 |
330,628 |
code |
|
29,597,262 |
7,278,098,260 |
246 KB |
|
Breaking this data down into buckets, you get groupings like this...
(see attachment if you don't see image)

Just because it's in the collab bucket doesn't mean that it's a good target. Looking across a file server or whatever it is you plan to migrate from, you'll notice old files, simple junk that will never be useful. If you have a way of aging the content and changing the culture for what's new and slowly weeding people off this system... you'll be further ahead.
This post is more for you to see an example of classifying data so you can visualize it and understand "what's out there." It also gives you an idea of sizes and file types that you'll come across. You'll have to make some decisions around what makes sense.
More on File Servers:
Is the File Server Dead?