Sample Data for Capacity Planning
I've been through many conversations that ended with... Ok, if we only knew how large the files were in our company then we could figure out this WAN thing, or figure out this Indexing thing. What's the average word doc or average document size in our company. Should we make something up? Well after looking at data from various samples I put together some good samples. Samples that I use. Word docs for example ranged from 100KB on average to about 300KB. The average file size averaged from 100KB to 600KB. These samples below may help you figure out what to expect on a few different levels.
NOTE: The table below does not consider the new compressed sizes for Office 2007 file formats. So this will may impact your planning (in a conservative way?).
Of course leveraging the object model you could construct your own real data. Hitting a File server should give you some data, but you'll find that doesn't line up with what you *plan* to do anyway. This first table is SPS 2003 and WSS 2.0 data of 2 million files from primarily sites with the "team" template.
(If you're viewing this table in your RSS and it looks ugly. Go to the online copy.)
|
File Type |
Total # |
Avg in Versions |
Avg Size (KB) |
Avg Doc visits |
% of all Docs |
% of all Doc Visits |
|
doc |
841561 |
1 |
363 |
11 |
32% |
29% |
|
xls |
456885 |
2 |
820 |
12 |
18% |
17% |
|
ppt |
346353 |
1 |
2021 |
14 |
13% |
15% |
|
jpg |
126935 |
0 |
347 |
11 |
5% |
4% |
|
pdf |
125726 |
0 |
745 |
15 |
5% |
6% |
|
htm |
122857 |
2 |
46 |
12 |
5% |
5% |
|
gif |
86426 |
0 |
13 |
10 |
3% |
3% |
|
zip |
56763 |
1 |
3888 |
11 |
2% |
2% |
|
msg |
43843 |
0 |
196 |
12 |
2% |
2% |
|
vsd |
43565 |
1 |
470 |
14 |
2% |
2% |
|
xml |
36205 |
2 |
71 |
7 |
1% |
1% |
|
html |
30993 |
2 |
18 |
9 |
1% |
1% |
|
txt |
26918 |
1 |
130 |
8 |
1% |
1% |
|
mht |
25047 |
2 |
319 |
18 |
1% |
1% |
|
aspx |
19576 |
3 |
10 |
55 |
1% |
3% |
|
mpp |
15492 |
2 |
381 |
12 |
1% |
1% |
|
tif |
13112 |
0 |
1572 |
7 |
1% |
0% |
|
2418257 |
1.18 |
671.18 |
14 |
94% |
93% |
Let me add a few other observations. Most templates do enable versions for newly provisioned sites. The document workspace for example enables versions by default. Is it interesting to note that some files are visited more frequently than others? Knowing that PPT files on average are around 2-2.5 MB. Does that help you with WAN and latency planning? If 30% of all your files are word docs and they are 300KB, then optimizing for those files may help.
Maybe looking at this list, or your list will help you understand what IFilters you should consider. Looking at this environment I'd think they need to consider indexing of TIF, ZIP (free beta x86 & x64), and PDF. Those Ifilters and their impact on Indexing performance is definately something I'd be concerned about as well. The popularity of .MSG could imply a number of things, including the fact that people want to retain data from email. Jopx get's a kick for his link to IFilters including CAD, Audio/Video (nice list!).
Additional Reading: Indexing pdf documents with Adobe Reader v.8 and MOSS 2007
I hope you find this reference useful. Have a WSS 2.0 or SPS 2003 environment you're considering upgrading and looking to get this type of data? Attached is a 25K compressed utility to gather information on WSS 2.0/SPS 2003. I've begged people to build one for 2007. I'll let you know when it's available. Don't extrapolate too much from this chart, but I hope you find it useful.
I was working with a team that was trying to figure out the typical site size. A year ago, I ran a report and came up with these results for "Team" sites at Microsoft across 1TB of data. Obviously quota and whether self service is enabled or not will play in.
Site Collection Sizes (with quota at 500MB) using airline booking:
| Percent Large (250MB+): 4% |
| Percent Medium (5-250MB): 34% |
| Percent Small (5MB-): 62% |
What about how many users per site?
Average Number of Users per Site: 47
Average Number of Domain Groups per Site: 1
Average Number of Readers (User) per Site: 24
Average Number of Contributors (User) per Site: 24
Average Number of Web Designers (User) per Site: 4
Average Number of Administrators (User) per Site: 4
List Size Distribution:
Total Number of Lists: 302156
Total Number Large (500+ items): 471
Total Number Medium (25-500 items): 15588
Total Number Small (25- items): 127965
Average Number of Items: 22
WSS Templates:
Team: 50% of all, 75% of usage
Doc workspace: 23%, 10% of usage
Meeting workspace(s): 18%, 3% of usage
Blank: 7%, 12% of usage
For quick reference, if you're trying to build fake sites for capacity planning reasons, you may find this tool (SharePoint data population tool - WSSDW.exe) helpful. http://www.codeplex.com/sptdatapop