For the first go-round of the Wide Finder project, the idea was to mimic the behavior of the following Ruby program:
{noformat}
counts = {}
counts.default = 0
ARGF.each_line do |line|
if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
counts[$1] += 1
end
end
keys_by_count = counts.keys.sort_by{ |key| -counts[key] }
keys_by_count[0 .. 9].each do |key|
puts "#{counts[key]}: #{key}"
end
{noformat}
It has been pointed out that this could be expressed as a Unix one-liner with awk and sort. Suggestions are welcomed to increase the degree of difficulty and increase the element of computation.
h3. WF2 Benchmark
Based on discussion on this page, I have proposed, and have heard no real objections, that we adopt a suggestion originally entitled "normal HTML report", as follows:
* Report the top 10 URIs, by hits, which are articles (much the same regex as before)
* Report the top 10 URIs of any kind by aggregate byte count downloaded
* Report the top 10 URIs of any kind that returned 404 Not Found
* Report the top 10 client addresses that fetched hit-counted articles
* Report the top 10 referrers for hit-counted articles
Here is a naive implementation in 78 lines of Ruby code, followed by discussion:
{noformat}
class Hash
def top(howmany)
count = 0
keys = []
min = 0
self.each do |key, val|
if val > min
keys << key
keys.sort! do |k1, k2|
diff = self[k2] - self[k1]
(diff != 0) ? diff : k1 <=> k2
end
keys.pop if keys.length > howmany
min = self[keys[-1]]
end
end
keys
end
end
@u_hits = {}
@u_bytes = {}
@s404s = {}
@clients = {}
@refs = {}
@count = 0
@u_hits.default = @u_bytes.default = @s404s.default =
@clients.default = @refs.default = 0
def record(client, u, bytes, ref)
@u_bytes[u] += bytes
@count += 1
if u =~ %r{^/ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+$}
@u_hits[u] += 1
@clients[client] += 1
unless (ref == '"-"' || ref =~ %r{^\"http://www.tbray.org/ongoing/})
@refs[ref[1 .. -2]] += 1 # lose the quotes
end
end
end
def report(label, hash, shrink = false)
puts "Top #{label}:"
keys_by_count = hash.top(10)
fmt = (shrink) ? " %9.1fM: %s\n" : " %10d: %s\n"
keys_by_count.each do |key|
pkey = (key.length > 60) ? key[0 .. 59] + "..." : key
hash[key] = hash[key] / (1024.0 * 1024.0) if shrink
printf fmt, hash[key], pkey
end
puts
end
ARGF.each_line do |line|
f = line.split(/\s+/)
next unless f[5] == '"GET'
client, u, status, bytes, ref = f[0], f[6], f[8], f[9], f[10]
if status == '200'
record(client, u, bytes.to_i, ref)
elsif status == '304'
record(client, u, 0, ref)
elsif status == '404'
@s404s[u] += 1
end
end
print "#{@u_hits.size} resources, #{@s404s.size} 404s, #{@clients.size} clients\n\n"
report('URIs by hit', @u_hits)
report('URIs by bytes', @u_bytes, true)
report('404s', @s404s)
report('client addresses', @clients)
report('referrers', @refs)
{noformat}
Here's my thinking as to why this is appropriate. The Wide Finder challenge, as I see it, is to find a way to run crushingly ordinary, boring code on many-core machines with good performance and with minimal tax on the programmer. This sample strikes me as crushingly ordinary, but not so simple that you can do it with a one-liner; it should remain substantially but not exclusively I/O dominated. There is a wrinkle; I'm pretty sure some of the numbers will get out of 32-bit integer range.
h3. (Saved) Proposed-test Discussion:
* compute the average articles and bytes downloaded per hour-of-day or day-of-week
* classify fetches and bandwidth devoted to (a) RSS/Atom syndication fees, (b) articles, (c) CSS/Javascript, (d) images
* fit curves modeling days since post versus number of hits for an article; generate curves for each category; report top categories
* generate an interpage navigation graph where nodes are pages and the edges are labelled with the number of times a link was followed from one page to another
* calculate session statistics, e.g. duration, pages per session, time between return visits, ...
* A normal HTML report: Top 10 popular URLs, top 10 404s, top 10 agents, top 10 client IPs (by
hits or bytes), and the top 10 referrers.
* (your suggestion here)
Generally it would be good to have one problem that is easily partitioned across processes/machines vs one that has interdependencies between lines e.g. navigation graph above
h3. Performance Measures:
* Elapsed time to complete test
* CPU time consumed during test
* Peak memory consumption
h3. Complexity Measures:
* LOCS
* Number of required runtime parameters (e.g. specifying the number of threads/processes to spawn)
* Usage of non-idiomatic constructs (e.g. using a byte array instead of a built-in string class)
* Subjective assessment of complexity (poll?)
h3. Robustness Measures:
* Maturity of underlying language and runtime
* Maturity of underlying libraries (e.g. production library, experimental library, library developed just for Wide Finder)
* Handling of non-ASCII character sets
* Whether the solution is maintainable code e.g. parsing and iteration code can be refactored into a common module for all benchmarks.
{noformat}
counts = {}
counts.default = 0
ARGF.each_line do |line|
if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
counts[$1] += 1
end
end
keys_by_count = counts.keys.sort_by{ |key| -counts[key] }
keys_by_count[0 .. 9].each do |key|
puts "#{counts[key]}: #{key}"
end
{noformat}
It has been pointed out that this could be expressed as a Unix one-liner with awk and sort. Suggestions are welcomed to increase the degree of difficulty and increase the element of computation.
h3. WF2 Benchmark
Based on discussion on this page, I have proposed, and have heard no real objections, that we adopt a suggestion originally entitled "normal HTML report", as follows:
* Report the top 10 URIs, by hits, which are articles (much the same regex as before)
* Report the top 10 URIs of any kind by aggregate byte count downloaded
* Report the top 10 URIs of any kind that returned 404 Not Found
* Report the top 10 client addresses that fetched hit-counted articles
* Report the top 10 referrers for hit-counted articles
Here is a naive implementation in 78 lines of Ruby code, followed by discussion:
{noformat}
class Hash
def top(howmany)
count = 0
keys = []
min = 0
self.each do |key, val|
if val > min
keys << key
keys.sort! do |k1, k2|
diff = self[k2] - self[k1]
(diff != 0) ? diff : k1 <=> k2
end
keys.pop if keys.length > howmany
min = self[keys[-1]]
end
end
keys
end
end
@u_hits = {}
@u_bytes = {}
@s404s = {}
@clients = {}
@refs = {}
@count = 0
@u_hits.default = @u_bytes.default = @s404s.default =
@clients.default = @refs.default = 0
def record(client, u, bytes, ref)
@u_bytes[u] += bytes
@count += 1
if u =~ %r{^/ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+$}
@u_hits[u] += 1
@clients[client] += 1
unless (ref == '"-"' || ref =~ %r{^\"http://www.tbray.org/ongoing/})
@refs[ref[1 .. -2]] += 1 # lose the quotes
end
end
end
def report(label, hash, shrink = false)
puts "Top #{label}:"
keys_by_count = hash.top(10)
fmt = (shrink) ? " %9.1fM: %s\n" : " %10d: %s\n"
keys_by_count.each do |key|
pkey = (key.length > 60) ? key[0 .. 59] + "..." : key
hash[key] = hash[key] / (1024.0 * 1024.0) if shrink
printf fmt, hash[key], pkey
end
puts
end
ARGF.each_line do |line|
f = line.split(/\s+/)
next unless f[5] == '"GET'
client, u, status, bytes, ref = f[0], f[6], f[8], f[9], f[10]
if status == '200'
record(client, u, bytes.to_i, ref)
elsif status == '304'
record(client, u, 0, ref)
elsif status == '404'
@s404s[u] += 1
end
end
print "#{@u_hits.size} resources, #{@s404s.size} 404s, #{@clients.size} clients\n\n"
report('URIs by hit', @u_hits)
report('URIs by bytes', @u_bytes, true)
report('404s', @s404s)
report('client addresses', @clients)
report('referrers', @refs)
{noformat}
Here's my thinking as to why this is appropriate. The Wide Finder challenge, as I see it, is to find a way to run crushingly ordinary, boring code on many-core machines with good performance and with minimal tax on the programmer. This sample strikes me as crushingly ordinary, but not so simple that you can do it with a one-liner; it should remain substantially but not exclusively I/O dominated. There is a wrinkle; I'm pretty sure some of the numbers will get out of 32-bit integer range.
h3. (Saved) Proposed-test Discussion:
* compute the average articles and bytes downloaded per hour-of-day or day-of-week
* classify fetches and bandwidth devoted to (a) RSS/Atom syndication fees, (b) articles, (c) CSS/Javascript, (d) images
* fit curves modeling days since post versus number of hits for an article; generate curves for each category; report top categories
* generate an interpage navigation graph where nodes are pages and the edges are labelled with the number of times a link was followed from one page to another
* calculate session statistics, e.g. duration, pages per session, time between return visits, ...
* A normal HTML report: Top 10 popular URLs, top 10 404s, top 10 agents, top 10 client IPs (by
hits or bytes), and the top 10 referrers.
* (your suggestion here)
Generally it would be good to have one problem that is easily partitioned across processes/machines vs one that has interdependencies between lines e.g. navigation graph above
h3. Performance Measures:
* Elapsed time to complete test
* CPU time consumed during test
* Peak memory consumption
h3. Complexity Measures:
* LOCS
* Number of required runtime parameters (e.g. specifying the number of threads/processes to spawn)
* Usage of non-idiomatic constructs (e.g. using a byte array instead of a built-in string class)
* Subjective assessment of complexity (poll?)
h3. Robustness Measures:
* Maturity of underlying language and runtime
* Maturity of underlying libraries (e.g. production library, experimental library, library developed just for Wide Finder)
* Handling of non-ASCII character sets
* Whether the solution is maintainable code e.g. parsing and iteration code can be refactored into a common module for all benchmarks.