10赞

Pandas数据帧到Spark数据帧"无法合并类型错误"

作者：手机用户2402851155 | 2023-09-07 17:33

如何解决《Pandas数据帧到Spark数据帧"无法合并类型错误"》经验，为你挑选了1个好方法。

1> zero323..：

长话短说不依赖于模式推理.一般而言,它既昂贵又棘手.特别是数据中的某些列(例如event_dt_num)缺少值,这会推动Pandas将它们表示为混合类型(不丢失的字符串,缺失值的NaN).

如果您有疑问,最好将所有数据作为字符串读取并在之后进行转换.如果您可以访问代码簿,则应始终提供架构以避免出现问题并降低总体成本.

最后从驱动程序传递数据是反模式.您应该能够使用csv格式(Spark 2.0.0+)或spark-csv库(Spark 1.6及更低版本)直接读取此数据:

df = (spark.read.format("csv").options(header="true")
    .load("/path/tp/demo2016q1.csv"))

## root
##  |-- primaryid: string (nullable = true)
##  |-- caseid: string (nullable = true)
##  |-- caseversion: string (nullable = true)
##  |-- i_f_code: string (nullable = true)
##  |-- i_f_code_num: string (nullable = true)
##   ...
##  |-- to_mfr: string (nullable = true)
##  |-- occp_cod: string (nullable = true)
##  |-- reporter_country: string (nullable = true)
##  |-- occr_country: string (nullable = true)
##  |-- occp_cod_num: string (nullable = true)

在这种特殊情况下,添加inferSchema="true"选项也应该起作用,但最好还是避免它.您还可以提供如下架构:

from pyspark.sql.types import StructType

schema = StructType.fromJson({'fields': [{'metadata': {},
   'name': 'primaryid',
   'nullable': True,
   'type': 'integer'},
  {'metadata': {}, 'name': 'caseid', 'nullable': True, 'type': 'integer'},
  {'metadata': {}, 'name': 'caseversion', 'nullable': True, 'type': 'integer'},
  {'metadata': {}, 'name': 'i_f_code', 'nullable': True, 'type': 'string'},
  {'metadata': {},
   'name': 'i_f_code_num',
   'nullable': True,
   'type': 'integer'},
  {'metadata': {}, 'name': 'event_dt', 'nullable': True, 'type': 'integer'},
  {'metadata': {}, 'name': 'event_dt_num', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'mfr_dt', 'nullable': True, 'type': 'integer'},
  {'metadata': {}, 'name': 'mfr_dt_num', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'init_fda_dt', 'nullable': True, 'type': 'integer'},
  {'metadata': {},
   'name': 'init_fda_dt_num',
   'nullable': True,
   'type': 'string'},
  {'metadata': {}, 'name': 'fda_dt', 'nullable': True, 'type': 'integer'},
  {'metadata': {}, 'name': 'fda_dt_num', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'rept_cod', 'nullable': True, 'type': 'string'},
  {'metadata': {},
   'name': 'rept_cod_num',
   'nullable': True,
   'type': 'integer'},
  {'metadata': {}, 'name': 'auth_num', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'mfr_num', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'mfr_sndr', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'lit_ref', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'age', 'nullable': True, 'type': 'double'},
  {'metadata': {}, 'name': 'age_cod', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'age_grp', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'age_grp_num', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'sex', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'e_sub', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'wt', 'nullable': True, 'type': 'double'},
  {'metadata': {}, 'name': 'wt_cod', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'rept_dt', 'nullable': True, 'type': 'integer'},
  {'metadata': {}, 'name': 'rept_dt_num', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'to_mfr', 'nullable': True, 'type': 'string'},
  {'metadata': {}, 'name': 'occp_cod', 'nullable': True, 'type': 'string'},
  {'metadata': {},
   'name': 'reporter_country',
   'nullable': True,
   'type': 'string'},
  {'metadata': {}, 'name': 'occr_country', 'nullable': True, 'type': 'string'},
  {'metadata': {},
   'name': 'occp_cod_num',
   'nullable': True,
   'type': 'integer'}],
 'type': 'struct'})

直接给读者:

(spark.read.schema(schema).format("csv").options(header="true")
    .load("/path/to/demo2016q1.csv"))

推荐阅读

程序员
错误:将已删除的函数'test :: test(const test&)C++与向量结合使用

如何解决《错误:将已删除的函数'test::test(consttest&)C++与向量结合使用》经验，为你挑选了1个好方法。 ... [详细]
程序员
使用Boost.Log的通道层次结构进行严重性和接收过滤

如何解决《使用Boost.Log的通道层次结构进行严重性和接收过滤》经验，为你挑选了1个好方法。 ... [详细]
程序员
PropTypes使用动态键检查对象

如何解决《PropTypes使用动态键检查对象》经验，为你挑选了3个好方法。 ... [详细]
程序员
如何删除Microsoft Azure存储中的租用blob

如何解决《如何删除MicrosoftAzure存储中的租用blob》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何在Python中使用OpenCV Stitcher类？

如何解决《如何在Python中使用OpenCVStitcher类？》经验，为你挑选了1个好方法。 ... [详细]
程序员
当用户在wpf中悬停时,我如何突出显示行？

如何解决《当用户在wpf中悬停时,我如何突出显示行？》经验，为你挑选了1个好方法。 ... [详细]
程序员
InnerHTML无法正常工作

如何解决《InnerHTML无法正常工作》经验，为你挑选了1个好方法。 ... [详细]
程序员
从app.config文件中读取

如何解决《从app.config文件中读取》经验，为你挑选了1个好方法。 ... [详细]
程序员
将JSON反序列化为Object时出错

如何解决《将JSON反序列化为Object时出错》经验，为你挑选了0个好方法。 ... [详细]
程序员
使用class.ind()从整数因子中溢出整数？

如何解决《使用class.ind()从整数因子中溢出整数？》经验，为你挑选了1个好方法。 ... [详细]
程序员
Angular 2 - 显示来自promise的异步对象数据

如何解决《Angular2-显示来自promise的异步对象数据》经验，为你挑选了3个好方法。 ... [详细]
程序员
如何将bool列表折叠为整数列表

如何解决《如何将bool列表折叠为整数列表》经验，为你挑选了1个好方法。 ... [详细]
程序员
从嵌套组件中使用RouterLink

如何解决《从嵌套组件中使用RouterLink》经验，为你挑选了1个好方法。 ... [详细]
程序员
为什么AndroidAsync断开时间这么久？

如何解决《为什么AndroidAsync断开时间这么久？》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何干掉重复嵌套的HAML？

如何解决《如何干掉重复嵌套的HAML？》经验，为你挑选了1个好方法。 ... [详细]
程序员
bash中值的值

如何解决《bash中值的值》经验，为你挑选了1个好方法。 ... [详细]
程序员
了解Word2Vec的Skip-Gram结构和输出

如何解决《了解Word2Vec的Skip-Gram结构和输出》经验，为你挑选了0个好方法。 ... [详细]
程序员
使用Spring-data-cassandra查询带有复合主键的表

如何解决《使用Spring-data-cassandra查询带有复合主键的表》经验，为你挑选了1个好方法。 ... [详细]
程序员
给定一个函数管道(foo,bar,baz)(1,2,3),你如何在javascript中实现它等同于baz(bar(foo(1,2,3))

如何解决《给定一个函数管道(foo,bar,baz)(1,2,3),你如何在javascript中实现它等同于baz(bar(foo(1,2,3))》经验，为你挑选了1个好方法。 ... [详细]
程序员
Pytest - 没有测试

如何解决《Pytest-没有测试》经验，为你挑选了1个好方法。 ... [详细]

手机用户2402851155

这个屌丝很懒，什么也没留下！

关注作者

Tags | 热门标签

RankList | 热门文章