scala - Spark DataFrame handing empty String in OneHotEncoder -


i importing csv file (using spark-csv) dataframe has empty string values. when applied onehotencoder, application crashes error requirement failed: cannot have empty string name.. there way can around this?

i reproduce error in example provided on spark ml page:

val df = sqlcontext.createdataframe(seq(   (0, "a"),   (1, "b"),   (2, "c"),   (3, ""),         //<- original example has "a" here   (4, "a"),   (5, "c") )).todf("id", "category")  val indexer = new stringindexer()   .setinputcol("category")   .setoutputcol("categoryindex")   .fit(df) val indexed = indexer.transform(df)  val encoder = new onehotencoder()   .setinputcol("categoryindex")   .setoutputcol("categoryvec") val encoded = encoder.transform(indexed)  encoded.show() 

it annoying since missing/empty values highly generic case.

thanks in advance, nikhil

since onehotencoder not accept empty string name, or you'll following error :

java.lang.illegalargumentexception: requirement failed: cannot have empty string name. @ scala.predef$.require(predef.scala:233) @ org.apache.spark.ml.attribute.attribute$$anonfun$5.apply(attributes.scala:33) @ org.apache.spark.ml.attribute.attribute$$anonfun$5.apply(attributes.scala:32) [...]

this how : (there other way it, rf. @anthony 's answer)

i'll create udf process empty category :

import org.apache.spark.sql.functions._  def processmissingcategory = udf[string, string] { s => if (s == "") "na"  else s } 

then, i'll apply udf on column :

val df = sqlcontext.createdataframe(seq(    (0, "a"),    (1, "b"),    (2, "c"),    (3, ""),         //<- original example has "a" here    (4, "a"),    (5, "c") )).todf("id", "category")   .withcolumn("category",processmissingcategory('category))  df.show // +---+--------+ // | id|category| // +---+--------+ // |  0|       a| // |  1|       b| // |  2|       c| // |  3|      na| // |  4|       a| // |  5|       c| // +---+--------+ 

now, can go transformations

val indexer = new stringindexer().setinputcol("category").setoutputcol("categoryindex").fit(df) val indexed = indexer.transform(df) indexed.show // +---+--------+-------------+ // | id|category|categoryindex| // +---+--------+-------------+ // |  0|       a|          0.0| // |  1|       b|          2.0| // |  2|       c|          1.0| // |  3|      na|          3.0| // |  4|       a|          0.0| // |  5|       c|          1.0| // +---+--------+-------------+  val encoder = new onehotencoder().setinputcol("categoryindex").setoutputcol("categoryvec") val encoded = encoder.transform(indexed)  encoded.show // +---+--------+-------------+-------------+ // | id|category|categoryindex|  categoryvec| // +---+--------+-------------+-------------+ // |  0|       a|          0.0|(3,[0],[1.0])| // |  1|       b|          2.0|(3,[2],[1.0])| // |  2|       c|          1.0|(3,[1],[1.0])| // |  3|      na|          3.0|    (3,[],[])| // |  4|       a|          0.0|(3,[0],[1.0])| // |  5|       c|          1.0|(3,[1],[1.0])| // +---+--------+-------------+-------------+ 

edit:

@anthony 's solution in scala :

df.na.replace("category", map( "" -> "na")).show // +---+--------+ // | id|category| // +---+--------+ // |  0|       a| // |  1|       b| // |  2|       c| // |  3|      na| // |  4|       a| // |  5|       c| // +---+--------+ 

i hope helps!


Comments

Post a Comment