View Source

For the first go-round of the Wide Finder project, the idea was to mimic the behavior of the following Ruby program:

{noformat}
counts = {}
counts.default = 0

ARGF.each_line do |line|
if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
counts[$1] += 1
end
end

keys_by_count = counts.keys.sort_by{ |key| -counts[key] }
keys_by_count[0 .. 9].each do |key|
puts "#{counts[key]}: #{key}"
end
{noformat}

It has been pointed out that this could be expressed as a Unix one-liner with awk and sort. Suggestions are welcomed to increase the degree of difficulty and increase the element of computation.

h3. WF2 Benchmark

Based on discussion on this page, I have proposed, and have heard no real objections, that we adopt a suggestion originally entitled "normal HTML report", as follows:

* Report the top 10 URIs, by hits, which are articles (much the same regex as before)
* Report the top 10 URIs of any kind by aggregate byte count downloaded
* Report the top 10 URIs of any kind that returned 404 Not Found
* Report the top 10 client addresses that fetched hit-counted articles
* Report the top 10 referrers for hit-counted articles

Here is a naive implementation in 78 lines of Ruby code, followed by discussion:

{noformat}
class Hash

def top(howmany)
count = 0
keys = []
min = 0
self.each do |key, val|
if val > min
keys << key
keys.sort! do |k1, k2|
diff = self[k2] - self[k1]
(diff != 0) ? diff : k1 <=> k2
end
keys.pop if keys.length > howmany
min = self[keys[-1]]
end
end
keys
end
end

@u_hits = {}
@u_bytes = {}
@s404s = {}
@clients = {}
@refs = {}
@count = 0

@u_hits.default = @u_bytes.default = @s404s.default =
@clients.default = @refs.default = 0

def record(client, u, bytes, ref)
@u_bytes[u] += bytes
@count += 1
if u =~ %r{^/ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+$}
@u_hits[u] += 1
@clients[client] += 1
unless (ref == '"-"' || ref =~ %r{^\"http://www.tbray.org/ongoing/})
@refs[ref[1 .. -2]] += 1 # lose the quotes
end
end
end

def report(label, hash, shrink = false)
puts "Top #{label}:"
keys_by_count = hash.top(10)
fmt = (shrink) ? " %9.1fM: %s\n" : " %10d: %s\n"
keys_by_count.each do |key|
pkey = (key.length > 60) ? key[0 .. 59] + "..." : key
hash[key] = hash[key] / (1024.0 * 1024.0) if shrink
printf fmt, hash[key], pkey
end
puts
end

ARGF.each_line do |line|
f = line.split(/\s+/)
next unless f[5] == '"GET'
client, u, status, bytes, ref = f[0], f[6], f[8], f[9], f[10]

if status == '200'
record(client, u, bytes.to_i, ref)
elsif status == '304'
record(client, u, 0, ref)
elsif status == '404'
@s404s[u] += 1
end
end

print "#{@u_hits.size} resources, #{@s404s.size} 404s, #{@clients.size} clients\n\n"

report('URIs by hit', @u_hits)
report('URIs by bytes', @u_bytes, true)
report('404s', @s404s)
report('client addresses', @clients)
report('referrers', @refs)
{noformat}

Here's my thinking as to why this is appropriate. The Wide Finder challenge, as I see it, is to find a way to run crushingly ordinary, boring code on many-core machines with good performance and with minimal tax on the programmer. This sample strikes me as crushingly ordinary, but not so simple that you can do it with a one-liner; it should remain substantially but not exclusively I/O dominated. There is a wrinkle; I'm pretty sure some of the numbers will get out of 32-bit integer range.

h3. (Saved) Proposed-test Discussion:
* compute the average articles and bytes downloaded per hour-of-day or day-of-week
* classify fetches and bandwidth devoted to (a) RSS/Atom syndication fees, (b) articles, (c) CSS/Javascript, (d) images
* fit curves modeling days since post versus number of hits for an article; generate curves for each category; report top categories
* generate an interpage navigation graph where nodes are pages and the edges are labelled with the number of times a link was followed from one page to another
* calculate session statistics, e.g. duration, pages per session, time between return visits, ...
* A normal HTML report: Top 10 popular URLs, top 10 404s, top 10 agents, top 10 client IPs (by
hits or bytes), and the top 10 referrers.
* (your suggestion here)

Generally it would be good to have one problem that is easily partitioned across processes/machines vs one that has interdependencies between lines e.g. navigation graph above

h3. Performance Measures:
* Elapsed time to complete test
* CPU time consumed during test
* Peak memory consumption

h3. Complexity Measures:
* LOCS
* Number of required runtime parameters (e.g. specifying the number of threads/processes to spawn)
* Usage of non-idiomatic constructs (e.g. using a byte array instead of a built-in string class)
* Subjective assessment of complexity (poll?)

h3. Robustness Measures:
* Maturity of underlying language and runtime
* Maturity of underlying libraries (e.g. production library, experimental library, library developed just for Wide Finder)
* Handling of non-ASCII character sets
* Whether the solution is maintainable code e.g. parsing and iteration code can be refactored into a common module for all benchmarks.

The individuals who post here are part of the extended Sun Microsystems community and they might not be employed or in any way formally affiliated with Sun Microsystems. The opinions expressed here are their own, are not necessarily reviewed in advance by anyone but the individual authors, and neither Sun nor any other party necessarily agrees with them.

Copyright 1994-2009 Sun Microsystems, Inc.
Powered by Atlassian Confluence
Sun Guidelines on Public Discourse Privacy Policy Terms of Use Trademarks Site Map Employment Investor Relations Contact