Welcome to MSDN Blogs Sign in | Join | Help

Sample Data for Capacity Planning

I've been through many conversations that ended with...  Ok, if we only knew how large the files were in our company then we could figure out this WAN thing, or figure out this Indexing thing.  What's the average word doc or average document size in our company.  Should we make something up?  Well after looking at data from various samples I put together some good samples.  Samples that I use.  Word docs for example ranged from 100KB on average to about 300KB.  The average file size averaged from 100KB to 600KB.  These samples below may help you figure out what to expect on a few different levels. 

NOTE: The table below does not consider the new compressed sizes for Office 2007 file formats.  So this will may impact your planning (in a conservative way?).

Of course leveraging the object model you could construct your own real data.  Hitting a File server should give you some data, but you'll find that doesn't line up with what you *plan* to do anyway.  This first table is SPS 2003 and WSS 2.0 data of 2 million files from primarily sites with the "team" template. 

(If you're viewing this table in your RSS and it looks ugly.  Go to the online copy.)

 

File Type

Total #

Avg in Versions

Avg Size (KB)

Avg Doc visits

% of all Docs

% of all Doc Visits

doc

841561

1

363

11

32%

29%

xls

456885

2

820

12

18%

17%

ppt

346353

1

2021

14

13%

15%

jpg

126935

0

347

11

5%

4%

pdf

125726

0

745

15

5%

6%

htm

122857

2

46

12

5%

5%

gif

86426

0

13

10

3%

3%

zip

56763

1

3888

11

2%

2%

msg

43843

0

196

12

2%

2%

vsd

43565

1

470

14

2%

2%

xml

36205

2

71

7

1%

1%

html

30993

2

18

9

1%

1%

txt

26918

1

130

8

1%

1%

mht

25047

2

319

18

1%

1%

aspx

19576

3

10

55

1%

3%

mpp

15492

2

381

12

1%

1%

tif

13112

0

1572

7

1%

0%

2418257

1.18

671.18

14

94%

93%

 

Let me add a few other observations.  Most templates do enable versions for newly provisioned sites.  The document workspace for example enables versions by default.  Is it interesting to note that some files are visited more frequently than others?  Knowing that PPT files on average are around 2-2.5 MB.  Does that help you with WAN and latency planning?  If 30% of all your files are word docs and they are 300KB, then optimizing for those files may help. 

Maybe looking at this list, or your list will help you understand what IFilters you should consider.  Looking at this environment I'd think they need to consider indexing of TIFZIP (free beta x86 & x64), and PDF.  Those Ifilters and their impact on Indexing performance is definately something I'd be concerned about as well.  The popularity of .MSG could imply a number of things, including the fact that people want to retain data from email.  Jopx get's a kick for his link to IFilters including CAD, Audio/Video (nice list!).

Additional Reading: Indexing pdf documents with Adobe Reader v.8 and MOSS 2007

I hope you find this reference useful.  Have a WSS 2.0 or SPS 2003 environment you're considering upgrading and looking to get this type of data?  Attached is a 25K compressed utility to gather information on WSS 2.0/SPS 2003.  I've begged people to build one for 2007.  I'll let you know when it's available.  Don't extrapolate too much from this chart, but I hope you find it useful.

I was working with a team that was trying to figure out the typical site size.  A year ago, I ran a report and came up with these results for "Team" sites at Microsoft across 1TB of data.  Obviously quota and whether self service is enabled or not will play in.

Site Collection Sizes (with quota at 500MB) using airline booking:

Percent Large (250MB+): 4%
Percent Medium (5-250MB): 34%
Percent Small (5MB-): 62%

What about how many users per site?

Average Number of Users per Site: 47

Average Number of Domain Groups per Site: 1

Average Number of Readers (User) per Site: 24

Average Number of Contributors (User) per Site: 24

Average Number of Web Designers (User) per Site: 4

Average Number of Administrators (User) per Site: 4

 

 

List Size Distribution: 

Total Number of Lists: 302156

Total Number Large (500+ items): 471

Total Number Medium (25-500 items): 15588

Total Number Small (25- items): 127965

Average Number of Items: 22

WSS Templates:

Team: 50% of all, 75% of usage
Doc workspace: 23%, 10% of usage
Meeting workspace(s): 18%, 3% of usage
Blank: 7%, 12% of usage

For quick reference, if you're trying to build fake sites for capacity planning reasons, you may find this tool (SharePoint data population tool - WSSDW.exe) helpful. http://www.codeplex.com/sptdatapop

 

Published Saturday, April 21, 2007 12:47 AM by joelo
Attachment(s): SharePoint_Reports_7_13_04_release.zip

Comments

Saturday, April 21, 2007 7:32 PM by Bob Fox

# re: Sample Data for Capacity Planning

Excellent info Joel.  Thanks

Monday, April 23, 2007 7:36 AM by Tom

# re: Sample Data for Capacity Planning

The best solution is for Adobe to create a working IFilter for 64-bit. Thank you for writing this and all your other articles, they are all very helpful!!

Thursday, April 26, 2007 3:13 AM by Arno Nel 2.0 - the Strategic Architect

# Sharepoint Weekly 3

Joel does a great job once again with " Sample Data for Capacity Planning ". What is really interesting

Wednesday, September 12, 2007 11:08 AM by The Boiler Room - Mark Kruger, Microsoft SharePoint MVP

# 2007 MOSS Resource Links (Microsoft Office SharePoint Server)

2007 MOSS Resource Links (Microsoft Office SharePoint Server) Here is an assortment of various 2007 Microsoft

New Comments to this post are disabled
 
Page view tracker