Aug 29

God bless us everyone
We’re a broken people living under loaded gun
And it can’t be outfought
It can’t be outdone
It can’t out matched
It can’t be outrun
No

God bless us everyone
We’re a broken people living under loaded gun
And it can’t be outfought
It can’t be outdone
It can’t out matched
It can’t be outrun
No

And when I close my eyes tonight
To symphonies of blinding light
(God bless us everyone
We’re a broken people living under loaded gun
Oh)
Like memories in cold decay
Transmissions echoing away
Far from the world of you and I
Where oceans bleed into the sky

God save us everyone,
Will we burn inside the fires of a thousand suns?
For the sins of our hand
The sins of our tongue
The sins of our father
The sins of our young
No

God save us everyone,
Will we burn inside the fires of a thousand suns?
For the sins of our hand
The sins of our tongue
The sins of our father
The sins of our young
No

And when I close my eyes tonight
To symphonies of blinding light
(God save us everyone,
Will we burn inside the fires of a thousand suns?
Oh)
Like memories in cold decay
Transmissions echoing away
Far from the world of you and I
Where oceans bleed into the sky

Like memories in cold decay
Transmissions echoing away
Far from the world of you and I
Where oceans bleed into the sky

Lift me up
Let me go (x10)

God bless us everyone
We’re a broken people living under loaded gun
And it can’t be outfought
Can’t be outdone
It can’t out matched
It can’t be outrun
No

God bless us everyone
We’re a broken people living under loaded gun
And it can’t be outfought
Can’t be outdone
It can’t out matched
It can’t be outrun

Tagged with:
Aug 25

Tagged with:
Aug 06
  • A9.com – Amazon
    • We build Amazon’s product search indices using the streaming API and pre-existing C++, Perl, and Python tools.
    • We process millions of sessions daily for analytics, using both the Java and streaming APIs.
    • Our clusters vary from 1 to 100 nodes.
  • Accela Communications
    • We use a Hadoop cluster to rollup registration and view data each night.
    • Our cluster has 10 1U servers, with 4 cores, 4GB ram and 3 drives
    • Each night, we run 112 Hadoop jobs
    • It is roughly 4X faster to export the transaction tables from each of our reporting databases, transfer the data to the cluster, perform the rollups, then import back into the databases than to perform the same rollups in the database.
  • Adobe
    • We use Hadoop and HBase in several areas from social services to structured data storage and processing for internal use.
    • We currently have about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development. We plan a deployment on an 80 nodes cluster.
    • We constantly write data to HBase and run MapReduce jobs to process then store it back to HBase or external systems.
    • Our production cluster has been running since Oct 2008.
  • Able Grape – Vertical search engine for trustworthy wine information
    • We have one of the world’s smaller hadoop clusters (2 nodes @ 8 CPUs/node)
    • Hadoop and Nutch used to analyze and index textual information
  • Adknowledge – Ad network
    • Hadoop used to build the recommender system for behavioral targeting, plus other clickstream analytics
    • We handle 500MM clickstream events per day
    • Our clusters vary from 50 to 200 nodes, mostly on EC2.
    • Investigating use of R clusters atop Hadoop for statistical analysis and modeling at scale.
  • Alibaba
    • A 15-node cluster dedicated to processing sorts of business data dumped out of database and joining them together. These data will then be fed into iSearch, our vertical search engine.
    • Each node has 8 cores, 16G RAM and 1.4T storage.
  • Amazon Web Services
    • We provide Amazon Elastic MapReduce. It’s a web service that provides a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
    • Our customers can instantly provision as much or as little capacity as they like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research.
  • AOL
    • We use hadoop for variety of things ranging from ETL style processing and statistics generation to running advanced algorithms for doing behavioral analysis and targeting.
    • Our cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity.
  • Atbrox
    • We use hadoop for information extraction & search, and data analysis consulting
    • Cluster: we primarily use Amazon’s Elastic Mapreduce
  • BabaCar
    • 4 nodes cluster (32 cores, 1TB).
    • We use Hadoop for searching and analysis of millions of rental bookings.
  • backdocsearch.com – search engine for chiropractic information, local chiropractors, products and schools
  • Baidu – the leading Chinese language search engine
    • Hadoop used to analyze the log of search and do some mining work on web page database
    • We handle about 3000TB per week
    • Our clusters vary from 10 to 500 nodes
    • Hypertable is also supported by Baidu
  • Beebler
    • 14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RAM)
    • We use hadoop for matching dating profiles
  • Benipal Technologies – Outsourcing, Consulting, Innovation
    • 35 Node Cluster (Core2Quad Q9400 Processor, 4-8 GB RAM, 500 GB HDD)
    • Largest Data Node with Xeon E5420*2 Processors, 64GB RAM, 3.5 TB HDD
    • Total Cluster capacity of around 20 TB on a gigabit network with failover and redundancy
    • Hadoop is used for internal data crunching, application development, testing and getting around I/O limitations
  • Bixo Labs – Elastic web mining
    • The Bixolabs elastic web mining platform uses Hadoop + Cascading to quickly build scalable web mining applications.
    • We’re doing a 200M page/5TB crawl as part of the public terabyte dataset project.
    • This runs as a 20 machine Elastic MapReduce cluster.
  • BrainPad – Data mining and analysis
    • We use Hadoop to summarize of user’s tracking data.
    • And use analyzing.
  • Cascading – Cascading is a feature rich API for defining and executing complex and fault tolerant data processing workflows on a Hadoop cluster.
  • Cloudera, Inc – Cloudera provides commercial support and professional training for Hadoop.
  • Contextweb – ADSDAQ Ad Excange
    • We use Hadoop to store ad serving log and use it as a source for Ad optimizations/Analytics/reporting/machine learning.
    • Currently we have a 23 machine cluster with 184 cores and about 35TB raw storage. Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of storage.
  • Cooliris – Cooliris transforms your browser into a lightning fast, cinematic way to browse photos and videos, both online and on your hard drive.
    • We have a 15-node Hadoop cluster where each machine has 8 cores, 8 GB ram, and 3-4 TB of storage.
    • We use Hadoop for all of our analytics, and we use Pig to allow PMs and non-engineers the freedom to query the data in an ad-hoc manner.
  • Cornell University Web Lab
    • Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive)
  • Datagraph
    • We use Hadoop for batch-processing large RDF datasets, in particular for indexing RDF data.
    • We also use Hadoop for executing long-running offline SPARQL queries for clients.
    • We use Amazon S3 and Cassandra to store input RDF datasets and output files.
    • We’ve developed RDFgrid, a Ruby framework for map/reduce-based processing of RDF data.
    • We primarily use Ruby, RDF.rb and RDFgrid to process RDF data with Hadoop Streaming.
    • We primarily run Hadoop jobs on Amazon Elastic MapReduce, with cluster sizes of 1 to 20 nodes depending on the size of the dataset (hundreds of millions to billions of RDF statements).
  • Datameer
    • Datameer Analytics Solution (DAS) is the first Hadoop-based solution for big data analytics that includes data source integration, storage, an analytics engine and visualization.
    • DAS Log File Aggregator is a plug-in to DAS that makes it easy to import large numbers of log files stored on disparate servers.
  • Deepdyve
    • Elastic cluster with 5-80 nodes
    • We use hadoop to create our indexes of deep web content and to provide a high availability and high bandwidth storage service for index shards for our search cluster.
  • Detikcom – Indonesia’s largest news portal
    • We use hadoop, pig and hbase to analyze search log, generate Most View News, generate top wordcloud, and analyze all of our logs
    • Currently We use 9 nodes
  • DropFire
    • We generate Pig Latin scripts that describe structural and semantic conversions between data contexts
    • We use Hadoop to execute these scripts for production-level deployments
    • Eliminates the need for explicit data and schema mappings during database integration
  • EBay
    • 532 nodes cluster (8 * 532 cores, 5.3PB).
    • Heavy usage of Java MapReduce, Pig, Hive, HBase
    • Using it for Search optimization and Research.
  • Enormo
    • 4 nodes cluster (32 cores, 1TB).
    • We use Hadoop to filter and index our listings, removing exact duplicates and grouping similar ones.
    • We plan to use Pig very shortly to produce statistics.
  • ESPOL University (Escuela Superior Politécnica del Litoral) in Guayaquil, Ecuador
    • 4 nodes proof-of-concept cluster.
    • We use Hadoop in a Data-Intensive Computing capstone course. The course projects cover topics like information retrieval, machine learning, social network analysis, business intelligence, and network security.
    • The students use on-demand clusters launched using Amazon’s EC2 and EMR services, thanks to its AWS in Education program.
  • ETH Zurich Systems Group
    • We are using Hadoop in a course that we are currently teaching: “Massively Parallel Data Analysis with MapReduce“. The course projects are based on real use-cases from biological data analysis.
    • Cluster hardware: 16 x (Quad-core Intel Xeon, 8GB RAM, 1.5 TB Hard-Disk)
  • Eyealike – Visual Media Search Platform
    • Facial similarity and recognition across large datasets.
    • Image content based advertising and auto-tagging for social media.
    • Image based video copyright protection.
  • Facebook
    • We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.
    • Currently we have 2 major clusters:
      • A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
      • A 300-machine cluster with 2400 cores and about 3 PB raw storage.
      • Each (commodity) node has 8 cores and 12 TB of storage.
    • We are heavy users of both streaming as well as the Java apis. We have built a higher level data warehousing framework using these features called Hive (see the http://hadoop.apache.org/hive/). We have also developed a FUSE implementation over hdfs.
  • FOX Audience Network
    • 40 machine cluster (8 cores/machine, 2TB/machine storage)
    • 70 machine cluster (8 cores/machine, 3TB/machine storage)
    • 30 machine cluster (8 cores/machine, 4TB/machine storage)
    • Use for log analysis, data mining and machine learning
  • Forward3D
    • 5 machine cluster (8 cores/machine, 5TB/machine storage)
    • Existing 19 virtual machine cluster (2 cores/machine 30TB storage)
    • Predominantly Hive and Streaming API based jobs (~20,000 jobs a week) using our Ruby library, or see the canonical WordCount example.
    • Daily batch ETL with a slightly modified clojure-hadoop
    • Log analysis
    • Data mining
    • Machine learning
  • Freestylers – Image retrieval engine
    • We Japanese company Freestylers use Hadoop to build the image processing environment for image-based product recommendation system mainly on Amazon EC2, from April 2009.
    • Our Hadoop environment produces the original database for fast access from our web application.
    • We also uses Hadoop to analyzing similarities of user’s behavior.
  • Google
  • Gruter. Corp.
    • 30 machine cluster (4 cores, 1TB~2TB/machine storage)
    • storage for blog data and web documents
    • used for data indexing by MapReduce
    • link analyzing and Machine Learning by MapReduce
  • GumGum
  • Hadoop Korean User Group, a Korean Local Community Team Page.
    • 50 node cluster In the Korea university network environment.
      • Pentium 4 PC, HDFS 4TB Storage
    • Used for development projects
      • Retrieving and Analyzing Biomedical Knowledge
      • Latent Semantic Analysis, Collaborative Filtering
  • Hulu
    • 13 machine cluster (8 cores/machine, 4TB/machine)
    • Log storage and analysis
    • Hbase hosting
  • Hadoop Taiwan User Group
  • Hipotecas y euribor
    • Evolución del euribor y valor actual
    • Simulador de hipotecas en crisis económica
  • Hosting Habitat
    • We use a customised version of Hadoop and Nutch in a currently experimental 6 node/Dual Core cluster environment.
    • What we crawl are our clients Websites and from the information we gather. We fingerprint old and non updated software packages in that shared hosting environment. We can then inform our clients that they have old and non updated software running after matching a signature to a Database. With that information we know which sites would require patching as a free and courtesy service to protect the majority of users. Without the technologies of Nutch and Hadoop this would be a far harder to accomplish task.
  • IBM
  • ICCS
    • We are using Hadoop and Nutch to crawl Blog posts and later process them. Hadoop is also beginning to be used in our teaching and general research activities on natural language processing and machine learning.
  • IIIT, Hyderabad
    • We use hadoop for Information Retrieval and Extraction research projects. Also working on map-reduce scheduling research for multi-job environments.
    • Our cluster sizes vary from 10 to 30 nodes, depending on the jobs. Heterogenous nodes with most being Quad 6600s, 4GB RAM and 1TB disk per node. Also some nodes with dual core and single core configurations.
  • ImageShack
    • From TechCrunch:
      • Rather than put ads in or around the images it hosts, Levin is working on harnessing all the data hisservice generates about content consumption (perhaps to better target advertising on ImageShack or to syndicate that targetting data to ad networks). Like Google and Yahoo, he is deploying the open-source Hadoop software to create a massive distributed supercomputer, but he is using it to analyze all the data he is collecting.
  • Information Sciences Institute (ISI)
  • Infochimps
    • 30 node AWS EC2 cluster (varying instance size, currently EBS-backed) managed by Chef & Poolparty running Hadoop 0.20.2+228, Pig 0.5.0+30, Azkaban 0.04, Wukong
    • Used for ETL & data analysis on terascale datasets, especially social network data (on api.infochimps.com)
  • Iterend
    • using 10 node hdfs cluster to store and process retrieved data.
  • Joost
    • Session analysis and report generation
  • Journey Dynamics
    • Using Hadoop MapReduce to analyse billions of lines of GPS data to create TrafficSpeeds, our accurate traffic speed forecast product.
  • Karmasphere
    • Distributes Karmasphere Studio for Hadoop, which allows cross-version development and management of Hadoop jobs in a familiar integrated development environment.
  • Katta – Katta serves large Lucene indexes in a grid environment.
  • Koubei.com Large local community and local search at China.
    • Using Hadoop to process apache log, analyzing user’s action and click flow and the links click with any specified page in site and more. Using Hadoop to process whole price data user input with map/reduce.
  • Krugle
    • Source code search engine uses Hadoop and Nutch.
  • Last.fm
    • 44 nodes
    • Dual quad-core Xeon L5520 (Nehalem) @ 2.27GHz, 16GB RAM, 4TB/node storage.
    • Used for charts calculation, log analysis, A/B testing
  • Lineberger Comprehensive Cancer Center – Bioinformatics Group This is the cancer center at UNC Chapel Hill. We are using Hadoop/HBase for databasing and analyzing Next Generation Sequencing (NGS) data produced for the Cancer Genome Atlas (TCGA) project and other groups. This development is based on the SeqWare open source project which includes SeqWare Query Engine, a database and web service built on top of HBase that stores sequence data types. Our prototype cluster includes:
    • 8 dual quad core nodes running CentOS
    • total of 48TB of HDFS storage
    • HBase & Hadoop version 0.20
  • LinkedIn
    • 2×50 Nehalem-based node grids, with 2×4 cores, 24GB RAM, 8x1TB storage using ZFS in a JBOD configuration.
    • We use Hadoop and Pig for discovering People You May Know and other fun facts.
  • Lookery
    • We use Hadoop to process clickstream and demographic data in order to create web analytic reports.
    • Our cluster runs across Amazon’s EC2 webservice and makes use of the streaming module to use Python for most operations.
  • Lotame
    • Using Hadoop and Hbase for storage, log analysis, and pattern discovery/analysis.
  • Markt24
    • We use Hadoop to filter user behaviour, recommendations and trends from externals sites
    • Using zkpython
    • Used EC2, no using many small machines (8GB Ram, 4 cores, 1TB)
  • MicroCode
    • 18 node cluster (Quad-Core Intel Xeon, 1TB/node storage)
    • Financial data for search and aggregation
    • Customer Relation Management data for search and aggregation
  • Media 6 Degrees
    • 20 node cluster (dual quad cores, 16GB, 6TB)
    • Used log processing, data analysis and machine learning.
    • Focus is on social graph analysis and ad optimization.
    • Use a mix of Java, Pig and Hive.
  • MobileAnalytic.TV
    • We use Hadoop to develop MapReduce algorithms:
      • Information retrival and analytics
      • Machine generated content – documents, text, audio, & video
      • Natural Language Processing
    • Project portfolio includes:
      • Natural Language Processing
      • Mobile Social Network Hacking
      • Web Crawlers/Page scrapping
      • Text to Speech
      • Machine generated Audio & Video with remuxing
      • Automatic PDF creation & IR
    • 2 node cluster (Windows Vista/CYGWIN, & CentOS) for developing MapReduce programs.
  • MyLife
    • 18 node cluster (Quad-Core AMD Opteron 2347, 1TB/node storage)
    • Powers data for search and aggregation
  • Mahout
    • Another Apache project using Hadoop to build scalable machine learning algorithms like canopy clustering, k-means and many more to come (naive bayes classifiers, others)
  • MetrixCloud – provides commercial support, installation, and hosting of Hadoop Clusters. Contact Us.
  • Neptune
    • Another Bigtable cloning project using Hadoop to store large structured data set.
    • 200 nodes(each node has: 2 dual core CPUs, 2TB storage, 4GB RAM)
  • NetSeer -
    • Up to 1000 instances on Amazon EC2
    • Data storage in Amazon S3
    • 50 node cluster in Coloc
    • Used for crawling, processing, serving and log analysis
  • The New York Times
  • Ning
    • We use Hadoop to store and process our log files
    • We rely on Apache Pig for reporting, analytics, Cascading for machine learning, and on a proprietary JavaScript API for ad-hoc queries
    • We use commodity hardware, with 8 cores and 16 GB of RAM per machine
  • Nutch – flexible web search engine software
  • PARC – Used Hadoop to analyze Wikipedia conflicts paper.
  • Pentaho – Open Source Business Intelligence
    • Pentaho provides the only complete, end-to-end open source BI alternative to proprietary offerings like Oracle, SAP and IBM
    • We provide an easy-to-use, graphical ETL tool that is integrated with Hadoop for managing data and coordinating Hadoop related tasks in the broader context of your ETL and Business Intelligence workflow
    • We also provide Reporting and Analysis capabilities against big data in Hadoop
    • Learn more at http://www.pentaho.com/hadoop
  • Pharm2Phork Project – Agricultural Traceability
    • Using Hadoop on EC2 to process observation messages generated by RFID/Barcode readers as items move through supply chain.
    • Analysis of BPEL generated log files for monitoring and tuning of workflow processes.
  • Powerset / Microsoft – Natural Language Search
  • Pressflip – Personalized Persistent Search
    • Using Hadoop on EC2 to process documents from a continuous web crawl and distributed training of support vector machines
    • Using HDFS for large archival data storage
  • Pronux
    • 4 nodes cluster (32 cores, 1TB).
    • We use Hadoop for searching and analysis of millions of bookkeeping postings
    • Also used as a proof of concept cluster for a cloud based ERP system
  • PSG Tech, Coimbatore, India
    • Multiple alignment of protein sequences helps to determine evolutionary linkages and to predict molecular structures. The dynamic nature of the algorithm coupled with data and compute parallelism of hadoop data grids improves the accuracy and speed of sequence alignment. Parallelism at the sequence and block level reduces the time complexity of MSA problems. Scalable nature of Hadoop makes it apt to solve large scale alignment problems.
    • Our cluster size varies from 5 to 10 nodes. Cluster nodes vary from 2950 Quad Core Rack Server, with 2x6MB Cache and 4 x 500 GB SATA Hard Drive to E7200 / E7400 processors with 4 GB RAM and 160 GB HDD.
  • Quantcast
    • 3000 cores, 3500TB. 1PB+ processing each day.
    • Hadoop scheduler with fully custom data path / sorter
    • Significant contributions to KFS filesystem
  • Rackspace
    • 30 node cluster (Dual-Core, 4-8GB RAM, 1.5TB/node storage)
  • Rakuten – Japan’s online shopping mall
    • 13 node cluster
    • We use Hadoop to analyze logs and mine data for recommender system and so on.
  • Rapleaf
    • 80 node cluster (each node has: 2 quad core CPUs, 4TB storage, 16GB RAM)
    • We use hadoop to process data relating to people on the web
    • We also involved with Cascading to help simplify how our data flows through various processing stages
  • Redpoll
    • Hardware: 35 nodes (2*4cpu 10TB disk 16GB RAM each)
    • We intend to parallelize some traditional classification, clustering algorithms like Naive Bayes, K-Means, EM so that can deal with large-scale data sets.
  • Search Wikia
    • A project to help develop open source social search tools. We run a 125 node hadoop cluster.
  • SEDNS – Security Enhanced DNS Group
    • We are gathering world wide DNS data in order to discover content distribution networks andconfiguration issues utilizing Hadoop DFS and MapRed.
  • SLC Security Services LLC
    • 18 node cluster (each node has: 4 dual core CPUs, 1TB storage, 4GB RAM, RedHat OS)
    • We use Hadoop for our high speed data mining applications
  • Socialmedia.com
    • 14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RAM)
    • We use hadoop to process log data and perform on-demand analytics
  • Spadac.com
    • We are developing the MrGeo (Map/Reduce Geospatial) application to allow our users to bring cloud computing to geospatial processing.
    • We use HDFS and MapReduce to store, process, and index geospatial imagery and vector data.
    • MrGeo is soon to be open sourced as well.
  • Stampede Data Solutions (Stampedehost.com)
    • Hosted Hadoop data warehouse solution provider
  • Taragana – Web 2.0 Product development and outsourcing services
    • We are using 16 consumer grade computers to create the cluster, connected by 100 Mbps network.
    • Used for testing ideas for blog and other data mining.
  • The Lydia News Analysis Project – Stony Brook University
    • We are using Hadoop on 17-node and 103-node clusters of dual-core nodes to process and extract statistics from over 1000 U.S. daily newspapers as well as historical archives of the New York Times and other sources.
  • Tailsweep – Ad network for blogs and social media
    • 8 node cluster (Xeon Quad Core 2.4GHz, 8GB RAM, 500GB/node Raid 1 storage)
    • Used as a proof of concept cluster
    • Handling i.e. data mining and blog crawling
  • Technical analysis and Stock Research
    • Generating stock analysis on 23 nodes (dual 2.4GHz Xeon, 2 GB RAM, 36GB Hard Drive)
  • Telefonica Research
    • We use Hadoop in our data mining and user modeling, multimedia, and internet research groups.
    • 6 node cluster with 96 total cores, 8GB RAM and 2 TB storage per machine.
  • Twitter
    • We use Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. We use Cloudera’s CDH2 distribution of Hadoop, and store all data as compressed LZO files.
    • We use both Scala and Java to access Hadoop’s MapReduce APIs
    • We use Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with few statements.
    • We employ committers on Pig, Avro, Hive, and Cassandra, and contribute much of our internal Hadoop work to opensource (see hadoop-lzo)
    • For more on our use of hadoop, see the following presentations: Hadoop and Pig at Twitter and Protocol Buffers and Hadoop at Twitter
  • Tynt
    • We use Hadoop to assemble web publishers’ summaries of what users are copying from their websites, and to analyze user engagement on the web.
    • We use Pig and custom Java map-reduce code, as well as chukwa.
    • We have 94 nodes (752 cores) in our clusters, as of July 2010, but the number grows regularly.
  • University of Glasgow – Terrier Team
    • 30 nodes cluster (Xeon Quad Core 2.4GHz, 4GB RAM, 1TB/node storage).We use Hadoop to facilitate information retrieval research & experimentation, particularly for TREC, using the Terrier IR platform. The open source release of Terrier includes large-scale distributed indexing using Hadoop Map Reduce.
  • University of Maryland
    • We are one of six universities participating in IBM/Google’s academic cloud computing initiative. Ongoing research and teaching efforts include projects in machine translation, language modeling, bioinformatics, email analysis, and image processing.
  • University of Nebraska Lincoln, Research Computing Facility
    • We currently run one medium-sized Hadoop cluster (200TB) to store and serve up physics data for the computing portion of the Compact Muon Solenoid (CMS) experiment. This requires a filesystem which can download data at multiple Gbps and process data at an even higher rate locally. Additionally, several of our students are involved in research projects on Hadoop.
  • Veoh
    • We use a small Hadoop cluster to reduce usage data for internal metrics, for search indexing and for recommendation data.
  • Visible Measures Corporation uses Hadoop as a component in our Scalable Data Pipeline, which ultimately powers VisibleSuite and other products. We use Hadoop to aggregate, store, and analyze data related to in-stream viewing behavior of Internet video audiences. Our current grid contains more than 128 CPU cores and in excess of 100 terabytes of storage, and we plan to grow that substantially during 2008.
  • VK Solutions
    • We use a small Hadoop cluster in the scope of our general research activities at VK Labs to get a faster data access from web applications.
    • We also use Hadoop for filtering and indexing listing, processing log analysis, and for recommendation data.
  • WorldLingo
    • Hardware: 44 servers (each server has: 2 dual core CPUs, 2TB storage, 8GB RAM)
    • Each server runs Xen with one Hadoop/HBase instance and another instance with web or application servers, giving us 88 usable virtual machines.
    • We run two separate Hadoop/HBase clusters with 22 nodes each.
    • Hadoop is primarily used to run HBase and Map/Reduce jobs scanning over the HBase tables to perform specific tasks.
    • HBase is used as a scalable and fast storage back end for millions of documents.
    • Currently we store 12million documents with a target of 450million in the near future.
  • Yahoo!
    • More than 100,000 CPUs in >36,000 computers running Hadoop
    • Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)
      • Used to support research for Ad Systems and Web Search
      • Also used to do scaling tests to support development of Hadoop on larger clusters
    • Our Blog – Learn more about how we use Hadoop.
    • >60% of Hadoop Jobs within Yahoo are Pig jobs.
  • Zvents
    • 10 node cluster (Dual-Core AMD Opteron 2210, 4GB RAM, 1TB/node storage)
    • Run Naive Bayes classifiers in parallel over crawl data to discover event information

那么一长排名单里,就Google最妖……用来教学……

Tagged with:
Jun 29

今天收到了Google Storage的开发者邀请:

Thanks for your interest in Google Storage for Developers.
Here is the invite link you requested:

####################################

Please note that this invitation is not transferable. In addition, Google storage is available for US developers only at this time.

During the preview period, you will receive up to 100GB of data storage and 300GB monthly bandwidth at no charge. To learn more about Google Storage, please visit our website for Developer’s Guide, API References, and pricing after the preview period (http://code.google.com/apis/storage/docs/overview.html).

We would love to hear your feedback at gs-discussion@googlegroups.com.

Thanks,
Google Storage for Developers Team

——

Google Storage不仅仅是个网络硬盘,这其实是云存储的雏形,提供100GB空间和每月300GB的流量,两个字:大气。

粗略看了下API文档,它提供详尽的接口供上传下载和管理,也就是说,你将这些API使用到自己的开发中去。

最让人震惊的是:

A CNAME redirect is a special DNS record that lets you use a URL from your own domain to access a resource (bucket and object) in Google Storage without revealing the Google Storage URI. To do this, you must use the following URI in the host name portion of your CNAME record:

c.commondatastorage.googleapis.com

For example, let’s assume your domain is example.com and you want to make travel maps available to your customers. You could create a bucket in Google Storage called travel-maps.example.com, and then create a CNAME record in DNS that redirects requests from travel-maps.example.com to the Google Storage URI. To do this, you publish the following CNAME record in DNS:

travel-maps.example.com CNAME c.commondatastorage.googleapis.com.

By doing this, your customers can use the following URL to access a map of Paris:

http://travel-maps.example.com/paris.jpg

你可以通过自己的域名,访问Storage上的文件。空间和流量全是Google提供,但别人都认为那是你网站上下载下来的。

价目表:

Google Storage for Developers pricing is based on usage.

  • Storage—$0.17/gigabyte/month
  • Network
    • Upload data to Google
      • $0.10/gigabyte
    • Download data from Google
      • $0.15/gigabyte for Americas and EMEA
      • $0.30/gigabyte for Asia-Pacific
  • Requests
    • PUT, POST, LIST—$0.01 per 1,000 requests
    • GET, HEAD—$0.01 per 10,000 requests

亚洲人没人权啊,居然贵了1倍。

我用官方给的Management上传了400MB的Rock,上传和下载速度始终是满的,结果相当满意。

就是不知道在国内这项服务正式推出后能活多久。

开发者的Storage统一用https://sandbox.google.com/storage/ 这个域名,希望受牵连少一点吧。

Tagged with:
May 06

曾经的理想是幼儿园老师
出道的机遇是?
苍井空:是在涩谷等朋友时被星探发掘的。
那个时候还是高中生吧?
苍井空: 对的,高中3年级。
被星探发掘的时候,是什么样的心情呢?
苍井空: 啊,说起来,当时还为难了一阵。但是,因为我需要钱……
说需要钱,应该不是借债了吧。还是高中生呀。
苍井空:不是欠债哟。我从进了高中之后,父母就不给零用钱了。因此,买衣服和玩的钱都必须通过打工来赚。
啊,确实没有零用钱的话很辛苦啊!
苍井空: 是哟。高中生也是很需要用钱的。经常想去Live和Concert嘛。虽然票价高,但是无论如何都想去听的Live有很多。
那么你之前打过些什么工呢?
苍井空: 各种各样的都干过。居酒屋、寿司店、Pizza店等等.
都是饮食业的嘛。
苍井空: 因为可以省下饭钱呀。(注:日本在饮食店打工的话一般都包饭。)对于高中生来说,这个是选择打工工种的point呀。
高中生的时候,将来想做什么工作?
苍井空: 想做幼儿园老师。但是,并没有特别认真地考虑过。

那么现在的工作是意料之外的咯?
苍井空: 对啊。直到被星探发现之前,从来都没有想到过。发现之后,也考虑了将近一年。
那么怎么下定决心做这份工作呢?
苍井空: 当时抱着想尝试一下新的事情的想法……但是当时没有想过要拍AV。刚开始工作的时候,只是穿着泳衣或者裸体替杂志做模特而已。
要严肃地看待自己的工作
对于裸照,你也没有反抗?
苍井空: 对的,决定当模特的时候,就有会拍裸照的觉悟了。
那么拍AV呢?
苍井空:替杂志拍照,比自己原先对于工作的想像要觉得开心。这个是最重要的理由。之后慢慢地觉得我也能适应拍AV吧。当时想,不是很开心的嘛。对于这个行业,之前我觉得是很黑暗的。
在镜头前ML也没有反抗?
苍井空:当然反抗了。但是因为终究是工作。这个和私人的ML是不一样的。但是我还是拼了命的在努力。
一度是“恋爱狂热者”
初次拍AV的感想怎样?
苍井空: 比想像中的要结束得快得多。倒也不是说没有得到满足。本来以为会很困难,但是完全没有关系。
感觉很好?
苍井空:那个嘛,男优果然技巧很好。但是那样的感觉好,和私下的感觉好是不一样的。
你对自己的身材满意吗?
苍井空:我觉得自己身材太瘦,胸部相形之下又太大,在镜头上看起来没什么,但自己每次洗澡时一照镜子就觉得“哇!好糟啊!”
那理想的身材类型是?
苍井空:当然是吉泽明步!她长得那么美,皮肤又好,160多公分的身高看起来腿好修长喔!每次看她的作品都让我羡慕得要命!
现在有男朋友吗?
苍井空:高中的时候,我是“恋爱狂热者”,总是不恋爱不行的那种。但现在很长一段时间都没有交男朋友。我还年轻,很希望在事业上好好冲刺,至于感情的事就随缘了!
如何看待自己的工作?
苍井空:我知道很多人都看不起我们(AV女.优),但我可是一直保持着尊严和专业,严肃地看待自己的工作,在这个前提下,监督叫我干啥我就干啥

Tagged with:
May 03

Jimey Google Reader Shared

原文: http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/
原作者:Dominic Williams
原文发布日期:February 24, 2010 at 7:27 pm
译者:王旭(http://wangxu.me/blog/ , @gnawux)
翻译时间:2010年3月21-25日

我的团队近来正在忙于一个全新的产品——即将发布的网络游戏 www.FightMyMonster.com。这让我们得以奢侈地去构建一个全新的 NOSQL 数据库,也就是说,我们可以把恐怖的 MySQL sharding 和昂贵的可伸缩性抛在脑后了。最近有很多人一直在问,为什么我们要把注意力从 HBase 上转移到 Cassandra 上去。我确认,确实有这样的变化,实际上我们基本上已经把代码移植到了 Cassandra 上了,这里我将给出解释。

为了那些不熟悉 NOSQL 的读者,后面的其他文章中,我会介绍为什么我们将会在未来几年中看到地震式的从 SQL 到 NOSQL 的迁移,这正和向云计算的迁移一样重要。后面的文章还会尝试解释为什么我认为 NOSQL 可能会是贵公司的正确选择。不过本文我只是解释我们选择 Cassandra 作为我们的 NOSQL 解决方案的选择。

免责声明——如果你正在寻找一个捷径来决定你的系统选择,你必须要明白,这可不是一个详尽而严格的比较,它只是概述了另一个初创团队在有限时间和资源的情况下的逻辑。

Cassandra 的血统是否预言了它的未来

我最喜欢的一个工程师们用来找 bug 的谒语是“广度优先而非深度优先”。这可以可能对那些解决技术细节的人来说很恼人,因为它暗示着如果他们只是看看的话,解决方法就会简单很多(忠告:只对那些能够原谅你的同事说这个)。我造出这个谒语的原因在于,我发现,软件问题中,如果我们强迫我们自己在进入某行代码的细节层面之前,先去看看那些高层次的考虑的话,可以节省大量时间。

所以,在谈论技术之前,我在做 HBase 和 Cassandra 之间的选择问题上先应用一下我的箴言。我们选择切换的技术结论可能已经可以预测了:Hbase和Cassandra有着迥异的血统和基因,而我认为这会影响到他们对我们的业务的适用性。

严格的说,Hbase 和它的支持系统源于著名的 Google BigTable 和 Google 文件系统设计(GFS 的论文发于 2003 年,BigTable 的论文发于 2006 年)。而 Cassandra 则是最近 Facebook 的数据库系统的开源分支,她在实现了 BigTable 的数据模型的同时,使用了基于 Amazon 的 Dynamo 的系统架构来存储数据(实际上,Cassandra 的最初开发工作就是由两位从 Amazon 跳槽到 Facebook 的 Dynamo 工程师完成的)。

在我看来,这些不同的历史也导致Hbase更加适合于数据仓库、大型数据的处理和分析(如进行Web页面的索引等),而 Cassandra 则更适合于实时事务处理和提供交互型数据。要进行系统研究来证明这个观点超出了本文的范畴,但我相信你在考虑数据库的时候总能发现这个差异的存在。

注意:如果你在寻找一个简单的证明,你可以通过主要 committer 的关注点来进行验证:大部分 HBase 的 committer 都为 Bing 工作(M$ 去年收购了他们的搜索公司,并允许他们在数月之后继续提交开源代码)。与之对应,Cassandra 的主要 committer 来自 Rackspace,用来可以自由获得的支持先进的通用的 NOSQL 的解决方案,用来和 Google, Yahoo, Amazon EC2 等提供的那些锁定在专有的 NOSQL 系统的方案相抗衡。

Malcolm Gladwell 会说只是根据这些背景的不同就可以简单地选择了 Cassandra。不过这是小马过河的问题。但当然,闭着眼睛就进行一个商业选择是相当困难的……

哪个 NOSQL数据库风头更劲?

另一个说服我们转向 Cassandra 的原因是我们社区中的大风向。如你所知,软件平台行业里,大者恒大——那些被普遍看好的平台,会有更多人聚集在这个平台周围,于是,从长远看,你可以得到更好的生态系统的支持(也就是说,大部分支持的软件可以从社区中获得,也有更多的开发者可以雇佣)。

如果从 HBase 开始时,我的印象就是它后面有巨大的社区力量,但我现在相信,Cassandra 更加强大。最初的印象部分来源于 StumpleUpon 和 Streamy 的两位 CTO 的两个非常有说服力的出色的讲演,他们是 Web 行业中两个在 Cassandra 成为一个可选系统之前的 HBase 的两个重要的贡献者,同时也部分来源于快速阅读了一篇名为“HBase vs Cassandra: NoSQL 战役!”的文章(大部分内容都被广泛证实了)。

势头是很难确证的,你不得不自己进行研究,不过我可以找到的一个重要的标志是 IRC 上的开发者动向。如果你在 freenode.org 上比较 #hbase 和 #cassandra 的开发这频道,你会发现 Cassandra 差不多在任何时候都有两倍的开发者在线。

如果你用考虑 HBase 一般的时间来考察 Cassandra,你就能发现 Cassandra 的背后确实有非常明显的加速势头。你可能还会发现那些逐渐出现的鼎鼎大名,如 Twitter,他们也计划广泛使用 Cassandra(这里)。

注:Cassandra 的网站看起来比 HBase 的好看多了,但认真的说,这可能不仅是市场的趋势。继续吧。

深入到技术部分: CAP 和 CA 与 AP 的神话

对于分布式系统,有个非常重要的理论(这里我们在讨论分布式数据库,我相信你注意到了)。这个理论被称为 CAP 理论,由 Inktomi 的 联合创始人兼首席科学家 Eric Brewer 博士提出。

这个理论说明,分布式(或共享数据)系统的设计中,至多只能够提供三个重要特性中的两个——一致性、可用性和容忍网络分区。简单的说,一致性指如果一个人向数据库写了一个值,那么其他用户能够立刻读取这个值,可用性意味着如果一些节点失效了,集群中的分布式系统仍然能继续工作,而容忍分区意味着,如果节点被分割成两组无法互相通信的节点,系统仍然能够继续工作。

Brewer教授是一个杰出的人物,许多开发者,包括 HBase 社区的很多人,都把此理论牢记在心,并用于他们的设计当中。事实上,如果你搜索线上的关于 HBase 和 Cassandra 比较的文章,你通常会发现,HBase 社区解释他们选择了 CP,而 Cassandra 选择了 AP ——毫无疑问,大多数开发者需要某种程度的一致性 (C)。

不过,我需要请你注意,事实上这些生命基于一个不完全的推论。CAP 理论仅仅适用于一个分布式算法(我希望 Brewer 教授可以统一)。但没有说明你不能设计一个系统,在其中的各种操作的底层算法选择上进行这种。所以,在一个系统中,确实一个操作职能提供这些特性中的两个,但被忽视的问题是在系统设计中,实际是可以允许调用者来选择他们的某个操作时需要哪些特性的。不仅如此,现实世界并不简单的划分为黑白两色,所有这些特性都可以以某种程度来提供。这就是 Cassandra。

这点非常重要,我重申:Cassandra 的优点在于你可以根据具体情况来选择一个最佳的折衷,来满足特定操作的需求。Cassandra 证明,你可以超越通常的 CAP 理论的解读,而世界仍然在转动。

我们来看看两种不同的极端。比如我必须从数据库中读取一个要求具有很高一致性的值,也就是说,我必须 100%保证能够读取到先前写入的最新的内容。在这种情况下,我可以通过指定一致性水平为“ALL”来从 Cassandra 读取数据,这时要求所有节点都有数据的一致的副本。这里我们不具有对任何节点失效和网络分裂的容错性。在另一个极端的方面,如果我不特别关心一致性,或仅仅就是希望最佳性能,我可以使用一致性级别“ONE”来访问数据。在这种情况下,从任意一个保存有这个副本的节点获取数据都可以——即使数据有三个副本,也并不在意其他两个有副本的节点是否失效或是否有不同,当然,这种情况下我们读到的数据可能不是最新的。

不仅如此,你不必被迫生活在黑白世界中。比如,在我们的一个特定的应用中,重要的读写操作通常使用“QUORUM”一致性级别,这意味着大部分存有此数据的节点上的副本是一致的——我这里是个简要描述,具体写你的 Cassandra 程序之前最好还是仔细研究一下。从我们的视角看,这这提供了一个合理的节点失效与网络分裂的耐受性,同时也提供了很高的一致性。而在一般情况下,我们使用前面提到的“ONE”一致性级别,者可以提供最高的性能。就是这样。

对我们来说,这是 Cassandra 的一个巨大的加分项目。我们不仅能轻易地调整我们的系统,也可以设计它。比如,当一定数量的节点失效或出现网络连接故障时,我们的大部分服务仍然可以继续工作,只有那些需要数据一致性的服务会失效。HBase并没有这么灵活,它单纯地追求系统的一个方面(CP),这让我再次看到了 SQL 开发者和查询优化人员们之间的那道隔阂——有些事情最好能够超越它,HBase!

In our project then, Cassandra has proven by far the most flexible system, although you may find your brain at first loses consistency when considering your QUORUMs.在我们的项目之后,卡桑德拉已被证明是迄今为止最灵活的系统,虽然你可能发现一致性第一失去你的大脑在考虑您的法定人数。

在我们的项目中,Cassandra 已经证明了它是有史以来最灵活的系统,虽然你可能在对这个问题进行投票(QUORUM)的时候发现的大脑失去了一致性。

什么时候单体会比模块化强?

Cassandra 和 HBase 的一个重要区别是, Cassandra 在每个节点是是一个单 Java 进程,而完整的 HBase 解决方案却由不同部分组成:有数据库进程本身,它可能会运行在多个模式;一个配置好的 hadoop HDFS 分布式文件系统,以及一个 Zookeeper 系统来协调不同的 HBase 进程。那么,这是否意味着 HBase 有更好的模块化结构呢?

虽然 HBase 的这种架构可能确实可以平衡不同开发团队的利益,在系统管理方面,模块化的 HBase 却无法视为一个加分项目。事实上,特别是对于一些小的初创公司,模块化倒是一个很大的负面因素。

HBase的下层相当复杂,任何对此有疑惑的人应该读读 Google 的 GFS 和 BigTable 的论文。即使是在一个单一节点的伪分布式模式下来架设 HBase 也很困难——事实上,我曾经费力写过一篇快速入门的教程(如果你要试试HBase的话看看这里)。在这个指南里你可以看到,设置好 HBase 并启动它实际包含了两个不同系统的手工设置:首先是 hadoop HDFS,然后才是 HBase 本身。

然后,HBase 的配置文件本身就是个怪兽,而你的设置可能和缺省的网络配置有极大的不同(在文章里我写了两个不同的Ubuntu的缺省网络设置,以及 EC2 里易变的 Elastic IP 和内部分配的域名)。当系统工作不正常的时候,你需要查看大量的日志。所有的需要修复的东西的信息都在日志里,而如果你是一个经验丰富的管理员的话,就能发现并处理问题。

但是,如果是在生产过程中出现问题,而你又没有时间耐心查找问题呢?如果你和我们一样,只有一个小的开发团队却有远大的目标,没有经历去 7*24 的进行系统监控管理会怎么样呢?

严肃地说,如果你是一个希望学习 NoSQL 系统的高级 DB 管理员的话,那么选择 HBase。这个系统超级复杂,有灵巧双手的管理员肯定能拿到高薪。

但是如果你们是一个向我们一样尽力去发现隧道尽头的小团队的话,还是等着听听别的闲话吧

胜在 Gossip!

Cassandra 是一个完全对称的系统。也就是说,没有主节点或像 HBase 里的 region server 这样的东西——每个节点的角色是完全一样的。不会有任何特定的节点或其他实体来充当协调者的角色,集群中的节点使用称为 “Cossip” 的纯 P2P 通信协议来协调他们的行为。

对 Gossip 的详细描述和使用 Gossip 的模型超过了本文的内容,但 Cassandra 所采用的 P2P 通信模型都是论证过的,比如发现节点失效的消息传播到整个系统的时间,或是一个客户应用的请求被路由到保存数据的节点的时间,所有这些过程所消耗的时间都毫无疑问的非常的短。我个人相信,Cassandra 代表了当今最振奋的一种 P2P 技术,当然,这和你的 NOSQL 数据库的选择无关。

那么,这个基于 Gossip 的架构究竟给 Cassandra 用户带来什么显示的好处呢。首先,继续我们的系统管理主体,系统管理变得简单多了。比如,增加一个新节点到系统中就是启动一个 Cassandra 进程并告诉它一个种子节点(一个已知的在集群中的节点)这么简单。试想当你的分布式集群可能运行在上百个节点的规模上的时候,如此轻易地增加新节点简直是难以置信。更进一步,当有什么出错的时候,你不需要考虑是哪种节点出了问题——所有节点都是一样的,这让调试成为了一个更加易于进行且可重复的过程。

第二,我可以得出结论,Cassandra 的 P2P 架构给了它更好的性能和可用性。这样的系统中,负载可以被均衡地三步倒各个节点上,来最大化潜在的并行性,这个能力让系统面临网络分裂和节点失效的时候都能更加的无缝,并且节点的对称性防止了 HBase 中发现的那种在节点加入和删除时的暂时性的性能都懂(Cassandra 启动非常迅速,并且性能可以随着节点的加入而平滑扩展)。

如果你想寻找更多更多的证据,你会对一个原来一直关注 hadoop 的小组(应该对 HBase 更加偏爱)的报告很感兴趣……

一份报告胜过千言万语。我是指图表

Yahoo!进行的第一个 NOSQL 系统的完整评测。研究似乎证实了 Cassandra 所享有的性能优势,从图表上看,非常倾向于 Cassandra。

目前这些论文还是草稿,你可以从这里找到这些论文:
http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf
http://www.brianfrankcooper.net/pubs/ycsb.pdf

注意:这份报告中 HBase 仅在对一个范围的记录进行扫描这一项上优于 Cassandra。虽然 Cassandra 团队相信他们可以很快达到 HBase 的时间,但还是值得指出,在通常的 Cassandra 配置中,区间扫描几乎是不可能的。我建议你可以无视这一点,因为实际上你应该在 Cassandra 上面来实现你自己的索引,而非使用区间扫描。如果你对区间扫描和在 Cassandra 中存储索引相关问题有兴趣,可以看我的这篇文章。

最后一点: 这篇文章背后的 Yahoo!研究团队正尝试让它们的评测应用通过法律部门的评估,并将它发布给社区。如果他们成功的话,我当然希望他们成功,我们将能够看到一个持续的竞争场面,不论 HBase 还是 Cassandra 无疑都会进一步提高他们的性能。

锁和有用的模块性

毫无疑问,你会从 HBase 阵营听到这样的声音:HBase 的复杂结构让它可以提供 Cassandra 的 P2P 架构无法提供的东西。其中一个例子可能就是 Hbase 提供给开发者行锁机制,而 Cassandra 则没有(在 HBase 中,因为数据副本发生在 hadoop 底层,行锁可以由 region server 控制,而在 Cassandra 的 P2P 架构中,所有节点都是平等的,所以也就没有节点可以像一个网管囊样负责锁定有副本的数据)。

不够,我还是把这个问题返回到关于模块化的争论中,这实际是对 Cassandra 有理的。Cassandra 通过在对称节点上分布式存储数据来实现了 BigTable 的数据模型。它完整地实现了这些功能,而且是以最灵活和高性能的方式实现的。但如果你需要锁、事务和其它功能的话,这些可以以模块的方式添加到你的系统之中——比如,我们发现我们可以使用 Zookeeper 和相关的工具来很简单地为我们的应用提供可扩展的锁功能(对于这个功能,Hazelcast 等系统可能也可以实现这个功能,虽然我们没有进行研究)。

通过为一个窄领域目的来最小化它的功能,对我来说,Cassandra 的设计达到了它的目的——比如前面指出可配置的 CAP 的折衷。这种模块性意味着你可以依据你的需求来构建一个系统——需要锁,那么拿来 Zookeeper,需要存储全文索引,拿来 Lucandra ,等等。对于我们这样的开发者来说,这意味着我们不必部署复杂度超出我们实际需要的系统,给我们提供了更加灵活的构建我们需要的应用的终极道路。

MapReduce,别提 MapReduce!

Cassandra 做的还不够好的一件事情就是 MapReduce!对于不精通此项技术同学简单的解释一句,这是一个用于并行处理大量数据的系统,比如从上百万从网络上抓取的页面提取统计信息。 MapReduce 和相关系统,比如 Pig 和 Hive 可以和 HBase 一起良好协作,因为它使用 HDFS 来存储数据,这些系统也是设计用来使用 HDFS 的。如果你需要进行这样的数据处理和分析的话,HBase 可能是你目前的最佳选择。

记住,这就像小马过河!

因此,我停止了对 Cassandra 的优点的赞美,实际上,HBase 和 Cassandra 并不一定是一对完全的竞争对手。虽然它们常常可以用于同样的用途,和 MySQL 和 PostgreSQL 类似,我相信在将来它们将会成为不同应用的首选解决方案。比如,据我所知 StumbleUpon 使用了 HBase 和 hadoop MapReduce 技术,来处理其业务的大量数据。Twitter 现在使用 Cassandra 来存储实时交互的社区发言,可能你已经在某种程度上使用它了。

作为一个有争议的临别赠言,下面我们进入下一个话题。

注意:在继续下一个小节之前,我要指出,Cassandra 在 0.6 版本会有 hadoop 支持,所以 MapReduce 整合能获得更好的支持。

兄弟,我不能失去数据…

作为先前 CAP 理论争议的一个可能结果,可能有这样的印象,HBase 的数据似乎比 Cassandra 中的数据更安全。这是我希望揭露的最后一个关于 Cassandra 的秘密,当你写入新数据的时候,它实际上立刻将它写入一个将要存储副本的仲裁节点的 commit log 当中了,也被复制到了节点们的内存中。这意味着如果你完全让你的集群掉电,只可能会损失极少数据。更进一步,在系统中,通过使用 Merkle tree 来组织数据的过分不一致(数据熵),更加增加了数据的安全性:)

事实上,我对 HBase 的情况并不是非常确切——如果能有更细节的情况,我回尽快更新这里的内容的——但现在我的理解是,因为 hadoop 还不支持 append,HBase 不能有效地将修改的块信息刷入 HDFS (新的对数据变化会被复制为多个副本并永久化保存)。这意味着会有一个更大的缺口,你最新的更改是不可见的(如果我错了,可能是这样,请告诉我,我回修正本文)。

所以,尽管希腊神话中的 Cassandra 非常不幸(译注:Cassandra 是希腊神话里,特洛伊的那个可怜的女先知的名字,如果你不知道详情的话,可以参考wiki),但你的 Cassandra 中的数据不会有危险。

注意:Wade Amold 指出, hadoop .21 很快就会发布,其中将会解决 HBase 的这个问题。

Tagged with:
May 01

Browser Timeline

Language Timeline

Tagged with:
May 01

有时候真是搞不明白到底谁抄谁的,现在HTML5里出了一个选择元素的方法:document.querySelectorAll

这东西用法和jQuery的用法一模一样。

估计是HTML5小组的协议修订者眼红jQuery的便捷吧

function $(selector,context){
  var result;
  context=context||document;
  if(selector[0]=="#") result=context.querySelector(selector);
  else result=context.querySelectorAll(selector);
  return result;
}

有了上面那段东西,在HTML5就基本取代了jQuery的Selector

Examples:

var mydiv  = $("#mydiv");
var mylist = $("li");
var links  = $("a");
for(i in links) write(links[i].href);
Tagged with:
Apr 19

Tagged with:
Apr 18

此君为编程高手,曾获上海市二等奖貌似
此君今天数学课下课大喊:”XX,又是一张闪啊!哎 输了输了….”
然后和我说.他一节数学课和计算器玩了6盘杀..
我仔细询问
他拿出一张纸条

具体的我也不大懂,只听他说可以表示花色,不同的牌都是有一定几率出现的
计算器能和你对打,最多可以支持六人
有待机界面,有出牌界面等等,很有意思的
然后我叫他演示一下
他说需要破解一下计算器
然后按mode几几几(具体的忘了)进入矩形界面

下一层楼的图就是对战画面 888表示3格血,8888表示四格血,小数部分貌似是花色
接着用个什么什么函数(太专业了..我真的一下子记不住)
再参照一下前面那张纸条,可以弄出判定啊,出牌啊什么的

Tagged with:
preload preload preload