使用pyspark从hdfs读取文件时拒绝连接-Java 学习之路

我安装了hadoop 2.7，设置路径并在core-site.xml和hdfs-site.xml中设置配置，如下所示：

core-site.xml

<configuration>
  <property>
    <name>fs.default.name</name>

    <value>hdfs://<ip_addr>:9000/</value>
  </property>
  <property>
    <name>dfs.data.dir</name>

    <value>/home/kavya/hdfs/data</value>
  </property>
  <property>
    <name>dfs.name.dir</name>

    <value>/home/kavya/hdfs/name</value>
  </property>
</configuration>

hdfs-site.xml:

<configuration>
  <property>
    <name>fs.default.name</name>

    <value>hdfs://<ip_addr>:9000/</value>
  </property>
  <property>
    <name>dfs.data.dir</name>

    <value>/home/kavya/hdfs/data</value>
  </property>
  <property>
    <name>dfs.name.dir</name>

    <value>/home/kavya/hdfs/name</value>
  </property>
</configuration>

我还使用start-dfs.sh启动了hdfs . 尽管在配置中提到了IP地址，但我得到连接拒绝错误，例如：

Call From spark/<ip_addr> to localhost:8020 failed on connection exception: java.net.ConnectException:Connection refused

我使用以下命令将文件存储到我的vm中的hdfs：

hadoop fs -put /opt/TestLogs/traffic_log.log /usr/local/hadoop/TestLogs

这是我在pyspark中的代码的一部分，用于从hdfs读取文件，然后提取字段：

file = sc.textFile("hdfs://<ip_addr>/usr/local/hadoop/TestLogs/traffic_log.log")
result = file.filter(lambda x: len(x)>0)
result = result.map(lambda x: x.split("\n"))
print(result) # PythonRDD[2] at RDD at PythonRDD.scala

lines = result.map(func1).collect() #this is where I get the connection refused error.
print(lines)

func1 是包含正则表达式的函数，用于从我的日志中提取字段 . 然后结果返回 lines . 直接从vm读取文本文件时，此程序工作正常 .

Spark版本：spark-2.0.2-bin-hadoop2.7 VM：CentOS

如何解决此错误？我错过了什么吗？

使用pyspark从hdfs读取文件时拒绝连接

相关问题