i have pyspark dataframe 3 columns. ddl of hive table 'test1' having string data types. if df.printschema string data type shown below,
>>> df = spark.sql("select * default.test1") >>> df.printschema() root |-- c1: string (nullable = true) |-- c2: string (nullable = true) |-- c3: string (nullable = true) +----------+--------------+-------------------+ |c1 |c2 |c3 | +----------+--------------+-------------------+ |april |20132014 |4 | |may |20132014 |5 | |june |abcdefgh |6 | +----------+--------------+-------------------+
now want filter records of integer type in 'c2' column. need first 2 records integer type '20132014'. , exclude other records.
in 1 line:
df.withcolumn("c2", df["c2"].cast("integer")).na.drop(subset=["c2"])
if c2
not valid integer, null
, dropped in subsequent step.
without changing type
valid = df.where(df["c2"].cast("integer").isnotnull()) invalid = df.where(df["c2"].cast("integer").isnull())
Comments
Post a Comment