... For the first go-round of the Wide Finder project, the idea was to mimic the behavior of the following Ruby program:
{noformat} counts = {} counts.default = 0
ARGF.each_line do |line| if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) } counts[$1] += 1 end end
keys_by_count = counts.keys.sort_by{ |key| -counts[key] } keys_by_count[0 .. 9].each do |key| puts "#{counts[key]}: #{key}" end {noformat}
It has been pointed out that this could be expressed as a Unix one-liner with awk and sort. Suggestions are welcomed to increase the degree of difficulty and increase the element of computation.
h3. WF2 Benchmark
Based on discussion on this page, I have proposed, and have heard no real objections, that we adopt a suggestion originally entitled "normal HTML report", as follows:
* Report the top 10 URIs, by hits, which are articles (much the same regex as before) * Report the top 10 URIs of any kind by aggregate byte count downloaded * Report the top 10 URIs of any kind that returned 404 Not Found * Report the top 10 client addresses that fetched hit-counted articles |
... class Hash
def top(howmany) count = 0 keys = [] min = 0 self.each do |key, val| if val > min keys << key keys.sort! do |k1, k2| diff = self[k2] - self[k1] (diff != 0) ? diff : k1 <=> k2 end keys.pop if keys.length > howmany min = self[keys[-1]] end end keys end end @u_hits = {} @u_bytes = {} @s404s = {} @clients = {} @refs = {} @count = 0
@u_hits.default = @u_bytes.default = @s404s.default = @clients.default = @refs.default = 0
def record(client, u, bytes, ref) @u_bytes[u] += bytes @count += 1 if u =~ %r{^/ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+$} @u_hits[u] += 1 @clients[client] += 1 unless (ref == '"-"' || ref =~ %r{^\"http://www.tbray.org/ongoing/}) @refs[ref[1 .. -2]] += 1 # lose the quotes end end end
def report(label, hash, shrink = false) puts "Top #{label}:" keys_by_count = hash.top(10) fmt = (shrink) ? " %9.1fM: %s\n" : " %10d: %s\n" keys_by_count.each do |key| pkey = (key.length > 60) ? key[0 .. 59] + "..." : key hash[key] = hash[key] / (1024.0 * 1024.0) if shrink printf fmt, hash[key], pkey end puts end
ARGF.each_line do |line| f = line.split(/\s+/) next unless f[5] == '"GET' client, u, status, bytes, ref = f[0], f[6], f[8], f[9], f[10]
if status == '200' record(client, u, bytes.to_i, ref) elsif status == '304' record(client, u, 0, ref) elsif status == '404' @s404s[u] += 1 end end
print "#{@u_hits.size} resources, #{@s404s.size} 404s, #{@clients.size} clients\n\n"
report('URIs by hit', @u_hits) report('URIs by bytes', @u_bytes, true) report('404s', @s404s) report('client addresses', @clients) report('referrers', @refs) {noformat}
Here's my thinking as to why this is appropriate. The Wide Finder challenge, as I see it, is to find a way to run crushingly ordinary, boring code on many-core machines with good performance and with minimal tax on the programmer. This sample strikes me as crushingly ordinary, but not so simple that you can do it with a one-liner; it should remain substantially but not exclusively I/O dominated. There is a wrinkle; I'm pretty sure some of the numbers will get out of 32-bit integer range.
h3. (Saved) Proposed-test Discussion: * compute the average articles and bytes downloaded per hour-of-day or day-of-week * classify fetches and bandwidth devoted to (a) RSS/Atom syndication fees, (b) articles, (c) CSS/Javascript, (d) images * fit curves modeling days since post versus number of hits for an article; generate curves for each category; report top categories * generate an interpage navigation graph where nodes are pages and the edges are labelled with the number of times a link was followed from one page to another * calculate session statistics, e.g. duration, pages per session, time between return visits, ... * A normal HTML report: Top 10 popular URLs, top 10 404s, top 10 agents, top 10 client IPs (by hits or bytes), and the top 10 referrers. * (your suggestion here)
Generally it would be good to have one problem that is easily partitioned across processes/machines vs one that has interdependencies between lines e.g. navigation graph above
h3. Performance Measures: * Elapsed time to complete test * CPU time consumed during test * Peak memory consumption
h3. Complexity Measures: * LOCS * Number of required runtime parameters (e.g. specifying the number of threads/processes to spawn) * Usage of non-idiomatic constructs (e.g. using a byte array instead of a built-in string class) * Subjective assessment of complexity (poll?)
h3. Robustness Measures: * Maturity of underlying language and runtime * Maturity of underlying libraries (e.g. production library, experimental library, library developed just for Wide Finder) * Handling of non-ASCII character sets * Whether the solution is maintainable code e.g. parsing and iteration code can be refactored into a common module for all benchmarks. |