suppose im writing a dating website, similar to okcupid. there are profiles, and i need to compute the (N^2) "match" table - given every 2 profiles whats the match between them?
i was thinking this could be done by creating a spout to listen on a "new/updated profiles" queue (say kafka, doesnt matter), but then how do i break down the matching to achieve any degree of parallelism?
if i have a single bolt that compares the profile vs the entire DB that wont scale.
if i create another spout, for "all profiles" it will run in a continous loop and never stop (?)
obviously the assumption is that the "churn rate" (rate of new/updated profiles) is less of an issue than the sheer size of the database.
any suggestions on how to design the topology would be very welcome.
Aucun commentaire:
Enregistrer un commentaire