我的照片
姓名:
位置: 北京, China

2010年10月19日 星期二

Cloud era - Hello, Pro Hadoop

1. Map Reduce Mode


2. Web Crawler Samples
  1. Ingest the URLs and their associated metadata.
  2. Normalize the URLs.
  3. Eliminate duplicate URLs.
  4. Filter the URLs against a set of exclusion and inclusion filters.
  5. Filter the URLs against a do not fetch list.
  6. Filter the URLs against a recently seen set.
  7. Fetch the URLs.
  8. Fingerprint the content items.
  9. Update the recently seen set.
  10. Prepare the work list for the next application.
3. Web Crawler Samples In Hadoop Arch


4. Map Reduce Framework Need to Know
  • The location(s) in the distributed file system of the job input
  • The location(s) in the distributed file system for the job output
  • The input format
  • The output format
  • The class containing the map function
  • Optionally. the class containing the reduce function
  • The JAR file(s) containing the map and reduce functions and any support classes
5. Hadoop Examples

0 条评论:

发表评论

订阅 帖子评论 [Atom]

<< 主页