i importing csv file (using spark-csv) dataframe has empty string values. when applied onehotencoder, application crashes error requirement failed: cannot have empty string name.. there way can around this?
i reproduce error in example provided on spark ml page:
val df = sqlcontext.createdataframe(seq( (0, "a"), (1, "b"), (2, "c"), (3, ""), //<- original example has "a" here (4, "a"), (5, "c") )).todf("id", "category") val indexer = new stringindexer() .setinputcol("category") .setoutputcol("categoryindex") .fit(df) val indexed = indexer.transform(df) val encoder = new onehotencoder() .setinputcol("categoryindex") .setoutputcol("categoryvec") val encoded = encoder.transform(indexed) encoded.show() it annoying since missing/empty values highly generic case.
thanks in advance, nikhil
since onehotencoder not accept empty string name, or you'll following error :
java.lang.illegalargumentexception: requirement failed: cannot have empty string name. @ scala.predef$.require(predef.scala:233) @ org.apache.spark.ml.attribute.attribute$$anonfun$5.apply(attributes.scala:33) @ org.apache.spark.ml.attribute.attribute$$anonfun$5.apply(attributes.scala:32) [...]
this how : (there other way it, rf. @anthony 's answer)
i'll create udf process empty category :
import org.apache.spark.sql.functions._ def processmissingcategory = udf[string, string] { s => if (s == "") "na" else s } then, i'll apply udf on column :
val df = sqlcontext.createdataframe(seq( (0, "a"), (1, "b"), (2, "c"), (3, ""), //<- original example has "a" here (4, "a"), (5, "c") )).todf("id", "category") .withcolumn("category",processmissingcategory('category)) df.show // +---+--------+ // | id|category| // +---+--------+ // | 0| a| // | 1| b| // | 2| c| // | 3| na| // | 4| a| // | 5| c| // +---+--------+ now, can go transformations
val indexer = new stringindexer().setinputcol("category").setoutputcol("categoryindex").fit(df) val indexed = indexer.transform(df) indexed.show // +---+--------+-------------+ // | id|category|categoryindex| // +---+--------+-------------+ // | 0| a| 0.0| // | 1| b| 2.0| // | 2| c| 1.0| // | 3| na| 3.0| // | 4| a| 0.0| // | 5| c| 1.0| // +---+--------+-------------+ val encoder = new onehotencoder().setinputcol("categoryindex").setoutputcol("categoryvec") val encoded = encoder.transform(indexed) encoded.show // +---+--------+-------------+-------------+ // | id|category|categoryindex| categoryvec| // +---+--------+-------------+-------------+ // | 0| a| 0.0|(3,[0],[1.0])| // | 1| b| 2.0|(3,[2],[1.0])| // | 2| c| 1.0|(3,[1],[1.0])| // | 3| na| 3.0| (3,[],[])| // | 4| a| 0.0|(3,[0],[1.0])| // | 5| c| 1.0|(3,[1],[1.0])| // +---+--------+-------------+-------------+ edit:
@anthony 's solution in scala :
df.na.replace("category", map( "" -> "na")).show // +---+--------+ // | id|category| // +---+--------+ // | 0| a| // | 1| b| // | 2| c| // | 3| na| // | 4| a| // | 5| c| // +---+--------+ i hope helps!
best post.
ReplyDeletescala online training